Monday, June 1, 2009

Testing various Regular Expression solutions, Part 1.5

One thing I really like about blogs is that you can exchange ideas and learn something you naturally won't know for a lifetime because you are not an expert. (And I hate you, Xanga Friends Lock users.)

Firstly, I have added the 3 options -DNS_BLOCK_ASSERTIONS=1 -DRKL_CACHE_SIZE=307 -DRKL_FAST_MUTABLE_CHECK=1 as instructed and able to reduce the running time from 1.7s to 0.7s, but still quite slow compared with ICU at 0.25s. Then I increased the cache size to 709, and the running time is now 0.4s. Further increasing to 1619 starts make the running time longer probably due to excessive memory use.

RKL,307 RKL,709
0.691226s 0.384044s
0.692440s 0.381194s
0.684274s 0.381550s
0.689568s 0.382489s
0.683985s 0.384499s

Then I consider rolling all 16 pattern into one big RegExp like RE1|RE2|RE3|.... One problem with this is each pattern must not have back-references, but this is fine for wildcard rules since back-references do not appear at all. But another nasty thing is that AdBlock Plus rules has the so called filter options which prevents simple merging to take place. Anyway, the running time after merging are:

ICU       PCRE      regex.h   RKLite    JSRegExp
0.260404s 0.189028s 2.526485s 0.268121s 0.141875s AllInOne
0.256532s 0.160290s 2.526872s 0.264368s 0.141586s
0.257263s 0.160135s 2.536937s 0.266934s 0.142434s
0.256900s 0.160222s 2.532003s 0.265274s 0.141467s
0.256744s 0.160232s 2.528987s 0.264851s 0.142299s
0.253102s 0.067551s 0.235714s 0.382489s 0.113274s SeveralInst (median of 5)

So while RegexKitLite does become faster, all other methods are dragged down, esp. the POSIX ERE. This confirms johne's statement.

There are some other developments: (1) I've found the fnmatch function that does wildcard matching, but the running time is 0.8s, probably due to the use of recursion and implicit calls on locales. (2) Then I've written a specific routine that checks using pure C string, and the result is dramatic -- the function only takes 0.02 seconds! This restored some of my logical beliefs :D.


The new source code can be found in:


  1. A few more observations..

    I found another bug in RegexKitLite. There is a bug in the UTF-16 conversion cache such that it is not re-using a previous UTF-16 conversion under some circumstances, causing a string to be matched to be converted to UTF-16 more times than necessary. I'll fix this in the next release of RegexKitLite, but as a quick hack / point patch, you can replace the line in rkl_setCacheSlotToString(), around line 676 (in 3.1):

    if((cacheSlot->setToUniChar != NULL) && ((cacheSlot->setToString == buffer->string) || ((cacheSlot->setToLength == buffer->length) && (cacheSlot->setToHash == buffer->hash)))) { rkl_dtrace_addLookupFlag(lookupResultFlags, RKLCacheHitLookupFlag); goto setRegexText; }


    if(((cacheSlot->setToString == buffer->string) || ((cacheSlot->setToLength == buffer->length) && (cacheSlot->setToHash == buffer->hash)))) { rkl_dtrace_addLookupFlag(lookupResultFlags, RKLCacheHitLookupFlag); cacheSlot->setToUniChar = buffer->uniChar; goto setRegexText; }

    A better tested and more thought out fix will appear in the next release, but this should do for now.

    As the author of RegexKitLite, my opinions of it are obvious biased, but for a lot of regex uses, it's nearly as fast as using the native ICU API. Except for some special cases, it's always going to be at least as slow as using ICU directly. Being an Objective-C wrapper, there's always going to be some overhead involved. Speed, in this case, is pretty much the difference between the wrapper overhead relative to if you had written a 'hand optimized' direct ICU API version. For a lot of uses, RegexKitLite gets within a few percent of a 'hand optimized' direct ICU API version, but there are definitely corner cases which will cause problems. Having a large number of regexes and using 'the next' regex after each regex operation is one of those corner cases. That kind of usage will tend to exceed the caches working set size, causing lots of cache thrashing. Same thing for strings that require UTF-16 conversion- constantly changing the subject string between regex operations will causes the (very limited) UTF-16 conversion cache to thrash.

    Safari AdBlock uses RegexKit.framework as its regex engine. AFAIK, CocoaMug's AdBlock is essentially a port of Safari AdBlock for the iPhone. I wrote functionality in to RegexKit.framework specifically for Safari AdBlock, so here's a few of the lessons learned:

    Once a block pattern "matches", you do not need to evaluate the remaining regexes, since you've already got your answer.

    Some block patterns tend to match more often than others. Since the order in which you check the regexes doesn't really matter, it can be a big win to keep the order in which you evaluate the regular expressions in a sorted list, based on the number of times a pattern has matched. If there is going to be a block match hit, this means that it's much more likely to happen in the first few checks, and once it hits, you can stop evaluating the remaining regexes.

    The most expensive operation is when none of the regexes match, i.e., the URL is not blocked.

    For web pages, the same URL often shows up more than once.

    Based on the above two points, it is a huge win to have a small "miss cache". You can take advantage of this by having something like NSString *missCache[23]; Then if([stringToCheck isEqual:missCache[[stringToCheck hash] % 23]]) { /* already checked, it's a miss. No need to check! */ }. When a string misses, you add it to the cache with something like "NSUInteger missHash = [stringToCheck hash]; [missCache[missHash] release]; missCache[missHash] = [stringToCheck copy];".

    Hope this helps.

  2. Looking at the times for the "AllInOne" vs. the "Individual", RegexKitLite and ICU are essentially equal in the AllInOne times, but RegexKitLite lags in the Individual times. I'm willing to bet the time discrepancy for the Individual times is due to the UTF-16 cache conversion bug I mentioned above. I'm guessing that the nature of the AllInOne test only needs to convert a subject string to UTF-16 once, and once only. The Individual tests are going to hit the bug, needlessly performing a UTF-16 conversion of the subject string again and again. If this is true, then I'd bet that applying the fix above will cause the individual times to drop down to just above the ICU times.

    Picking a cache size is a bit of an art. The cache function is [string hash] % cacheSize. If the regexes you want to use are completely static (which I don't think would be the case for your application), you can spend some time and find the smallest prime such that no two strings are congruent modulo the chosen prime. That is to say, no two strings "[string hash] % cacheSize" are the same, that there are no 'hash collisions'.

    If the regexes are dynamic, then you sort of have to 'bet on the odds', and a larger cacheSize increases the odds that no two regex string hashes will be congruent to each other. Larger cache sizes quickly run in to diminishing returns, though.

    FYI, a typical cache slot in RegexKitLite takes up 64 bytes. A cache size of 23, the default, takes up 1472 bytes. A cache size of 256 would take up 16384 (16K). Here's a list of primes less than 256:

    29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241

    Small steps in the cache size can make a huge difference, but whether or not a given cache size is better or worse is essentially random for all practical purposes. Also, a strings hash value is really only guaranteed to remain constant for the duration of a programs execution. A software update could change the hashing function, which just by chance might cause a huge number of collisions for your chosen cache size when the older hash function had no collisions. :( Like I said, it's an art, and primes generally provide the best performance (fewest collisions) for hash values that are 'essentially random'. As a very rough rule of thumb, a prime that is between 3 to 5 times the expected working set will provide collision free performance. So, if expect to have around 400 regexes in very active use at once, you'd need a cache size of somewhere between 400 * 3 to 400 * 5. For example, you could use the primes 1207 (* 3), 1601 (* 4), or 2003 (* 5).

    For RegexKitLite, the default cache size of 23 allows for a "working set" of about 4 to 7 regexes without collisions. Working set being roughly defined as "used repeatedly in close temporal proximity."