Sunday, May 31, 2009

Testing various Regular Expression solutions, Part 1

Edit: Please check also Part 1.5 for some further analysis.

For AdKiller to work properly, Regular expression matching should be supported. There are various RegExp solutions for the iPhoneOS:
  1. regex.h
  2. ICU's regular expression engine
  3. PCRE
  4. RegexKitLite, which is an ObjC wrapper of the ICU engine.
  5. JavaScriptCore's regular expression engine.
  6. Do it yourself!

For Ad blocking, we only need to test if the Regular Expression matches, so I've limited the test case to a simple RegExp.test only.

In this part, I'll test for the AdBlockPlus "wildcard rules", where a * matches anything, and optional | at the beginning or end to indicate an anchor. Because of the simplicity, we can actually do the test without using RegExp (e.g. using CFStringFind).

From the complexity of the RegExp engines, we would expect the time taken follow this order:
CFStringFind << regex.h << JSRegExp < ICU < RegexKitLite < PCRE



But somewhat surprisingly, the actual time taken is like this:
ICU       PCRE      regex.h   RKLite    JSRegExp  CFStrFind
-----------------------------------------------------------
0.253456s 0.069918s 1.726371s 1.651787s 0.139437s 0.078097s
0.275680s 0.067150s 1.715131s 1.652074s 0.112813s 0.075818s
0.251526s 0.067994s 1.716483s 1.637005s 0.113267s 0.075716s
0.250806s 0.067551s 1.721203s 1.626927s 0.113274s 0.075726s
0.251498s 0.067168s 1.716587s 1.651783s 0.113605s 0.076064s


So the ordering would be:
PCRE < CFStringFind < JSRegExp < ICU << RegexKitLite < regex.h


The most complex engine is the fastest! CFStringFind ranks the second probably because my code is not optimized, but this shows the PCRE engine is really efficient. JSRegExp was originally a PCRE with everything irrelevant to Javascript stripped out, but that is slower than PCRE by 2 times. The WebKit devs should think of what's wrong. RegexKitLite's performance is pretty disappointing. Although it is advertised as very efficient, it is still 6~7 times slower than bare-bone ICU. The problem is probably directly using NSString as the RegExp holder, instead of a dedicated RegExp class. And regex.h's result is really a *WTF*.


Edit: I have redone the test to eliminate some inconsistency (case sensitivity) and convert the regex.h's syntax form BRE to ERE. The result is impressive: ERE is faster than BRE by 7 fold, make in in par with ICU. The new results are:
Methods
ICU PCRE regex.h RKLite JSRegExp CFStrFind
-----------------------------------------------------------
0.253736s 0.069918s 0.238111s 1.651787s 0.139437s 0.070956s
0.254308s 0.067150s 0.235963s 1.652074s 0.112813s 0.068302s
0.252483s 0.067994s 0.235714s 1.637005s 0.113267s 0.068710s
0.251647s 0.067551s 0.234380s 1.626927s 0.113274s 0.068353s
0.253102s 0.067168s 0.235367s 1.651783s 0.113605s 0.068280s

and the new ordering is:
PCRE ≈ CFStringFind < JSRegExp < regex.h < ICU << RegexKitLite

So PCRE is still the fastest, but with a smaller margin. And RegexKitLite is now win the title of the slowest RegEx implementation.

To conclude this part, we should use PCRE for wildcard testing, if speed and clarity are important.

The source code of the test can be found in http://pastie.org/495990. Basically, 323 URLs collected from 3 different sources are matched against 16 wildcard rules. The file is compiled with:
/Developer/Platforms/iPhoneOS.platform/Developer/usr/bin/gcc-4.0 -arch armv6 -std=c99 -isysroot /Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS2.2.1.sdk -I/Developer/Platforms/iPhoneOS.platform/Developer/usr/include/gcc/darwin/default -I/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS2.2.1.sdk/usr/lib/gcc/arm-apple-darwin9/4.0.1/include -L/usr/local/lib -Wall -mcpu=arm1176jzf-s -O2 -L. -framework CoreFoundation -framework Foundation -licucore -lpcre main.m RegexKitLite.m ; ldid -S a.out
and run on a jailbroken iPod Touch 1G, firmware version 2.2.1.

2 comments:

  1. I checked the code you pasted at pastie.org for RegexKitLite as to why it was performing so slow. I did my tests on a MacBook Pro, not an iPhone, though.

    Using -DRKL_DTRACE, it was apparent that the regex cache size was too small. Since the test iterates over all the regular expressions sequentially for each URL, this is going to cause performance problems as compiled regexes are constantly going to be evicted from the cache, only to have to recompile them again a short time later. Compiling a regular expression is an extremely expensive operation.

    There are also a lot of run time debugging assertions that can be disabled with the standard NS_BLOCK_ASSERTIONS. Most developers set this compile time flag for production / shipping code.

    I'd recommend adding: -DNS_BLOCK_ASSERTIONS -DRKL_CACHE_SIZE=307

    Using these settings I was able to get a time of 0.013204 for RegexKitLite vs. 0.036467, which is quite a bit of improvement. The straight ICU test clocked in at 0.012532. This makes RegexKitLite just 5% slower than "raw, direct" ICU usage. ICU, and therefore RegexKitLite, are at a disadvantage in these kinds of tests because ICU only works on UTF-16 encoded strings. That means that the simple 7bit ASCII URLs being matched all need to be converted to UTF-16 before they can be searched.

    I'd also recommend that you consider writing your own regex engine if you're going to be doing "AdBlock Plus" like pattern matching of URLs to see which ones should be blocked. Due to the core engine design of the regex engines listed here, you are essentially forced to evaluate each regex once for each URL of interest. Each regex that you want to check increases the amount of time it takes to perform a "block check".

    The 'ideal' block checking regex engine would concatenate all the regular expressions in to one large regex. Something like '(?:R1|R2|R3)'. From experience, using such a regex with the engines listed above is slower than evaluating them individually.

    http://swtch.com/~rsc/regexp/regexp1.html

    Using a "Thompson NFA" based regex engine allows you to evaluate the concatenated regex very efficiently, essentially ~O(n), where n is the number of characters in the subject string. This means that for all the regexes you want to use to see if they match requires only a single execution of the concatenated regex. See http://swtch.com/~rsc/regexp/ for some simple Thompson NFA implementations. Since all you need is simple "Does this regex match this string?" functionality, I'd bet you could get one of the implementations up and running and it would obliterate the times posted by the regex engines here.

    ReplyDelete
  2. johne,

    Thanks for comment! I have added -DNS_BLOCK_ASSERTIONS=1 -DRKL_CACHE_SIZE=307 -DRKL_FAST_MUTABLE_CHECK=1 and is able to bring down the time to 0.7s, and further increasing the cache size to 709 (chosen arbitrarily) reduces it to 0.38s. It is still slower than ICU's 0.25s by 50% but it's more reasonable now.

    Anyway, I've reimplemented the wildcard testing part using pure C and is able to get down to 0.02s. I'll check the Thompson NFA when I have time.

    ReplyDelete