2013/01/24
Searching multibyte strings
While integrating a patch in string.c
I rememberd a couple of
optimizations in string-scan
I had planned to add but long forgotten.
The current code uses Boyer-Moore if both source string and search string are single-byte, but falls back to a naive search method when either one is multibyte. That's because, if the internal encoding is EUC-JP or Shift_JIS, bytewise scan could lead a spurious match (matching with incorrect character boundaries). However, utf-8 is designed in a way that such spurious match is impossible. I just made it always use Boyer-Moore search.
Another possible optimization is when the length of search string is 1. It may seem that you'd rather search the character instead of searching a string of length 1 in such a case. Unfortunately, one character may occupy multiple bytes, so the current "search a character in a string" naively converts it to search of length 1 string. I just changed it to handle single-byte search specially.
The optimization is pretty effective. The following simple benchmark shows searching multibyte and single-byte string within a long multibyte strings. (It shows slowdown when multibyte match is at the beginning of the source string, since the optimized one has overhead of setting up Boyer-Moore skip table. But in such cases the the search returns quickly and degradation won't won't be a problem.)
Optimized:
last-mb #<time-result 100000 times/ 17.202 real/ 17.170 user/ 0.040 sys> middle-mb #<time-result 100000 times/ 8.700 real/ 8.700 user/ 0.000 sys> first-mb #<time-result 100000 times/ 0.023 real/ 0.020 user/ 0.000 sys> last-sb #<time-result 100000 times/ 8.827 real/ 8.830 user/ 0.000 sys> middle-sb #<time-result 100000 times/ 4.530 real/ 4.520 user/ 0.000 sys> first-sb #<time-result 100000 times/ 0.011 real/ 0.020 user/ 0.000 sys>
Original:
last-mb #<time-result 100000 times/ 29.267 real/ 29.240 user/ 0.020 sys> middle-mb #<time-result 100000 times/ 14.604 real/ 14.590 user/ 0.010 sys> first-mb #<time-result 100000 times/ 0.014 real/ 0.020 user/ 0.000 sys> last-sb #<time-result 100000 times/ 26.303 real/ 26.290 user/ 0.020 sys> middle-sb #<time-result 100000 times/ 13.149 real/ 13.140 user/ 0.000 sys> first-sb #<time-result 100000 times/ 0.013 real/ 0.010 user/ 0.000 sys>
Tags: string-scan, Strings
Post a comment