Gauche Devlog

< Fun with primes | Curse of extended floating point arithmetic >

2013/01/24

Searching multibyte strings

While integrating a patch in string.c I rememberd a couple of optimizations in string-scan I had planned to add but long forgotten.

The current code uses Boyer-Moore if both source string and search string are single-byte, but falls back to a naive search method when either one is multibyte. That's because, if the internal encoding is EUC-JP or Shift_JIS, bytewise scan could lead a spurious match (matching with incorrect character boundaries). However, utf-8 is designed in a way that such spurious match is impossible. I just made it always use Boyer-Moore search.

Another possible optimization is when the length of search string is 1. It may seem that you'd rather search the character instead of searching a string of length 1 in such a case. Unfortunately, one character may occupy multiple bytes, so the current "search a character in a string" naively converts it to search of length 1 string. I just changed it to handle single-byte search specially.

The optimization is pretty effective. The following simple benchmark shows searching multibyte and single-byte string within a long multibyte strings. (It shows slowdown when multibyte match is at the beginning of the source string, since the optimized one has overhead of setting up Boyer-Moore skip table. But in such cases the the search returns quickly and degradation won't won't be a problem.)

Optimized:

last-mb    #<time-result 100000 times/ 17.202 real/ 17.170 user/  0.040 sys>
middle-mb  #<time-result 100000 times/  8.700 real/  8.700 user/  0.000 sys>
first-mb   #<time-result 100000 times/  0.023 real/  0.020 user/  0.000 sys>
last-sb    #<time-result 100000 times/  8.827 real/  8.830 user/  0.000 sys>
middle-sb  #<time-result 100000 times/  4.530 real/  4.520 user/  0.000 sys>
first-sb   #<time-result 100000 times/  0.011 real/  0.020 user/  0.000 sys>

Original:

last-mb    #<time-result 100000 times/ 29.267 real/ 29.240 user/  0.020 sys>
middle-mb  #<time-result 100000 times/ 14.604 real/ 14.590 user/  0.010 sys>
first-mb   #<time-result 100000 times/  0.014 real/  0.020 user/  0.000 sys>
last-sb    #<time-result 100000 times/ 26.303 real/ 26.290 user/  0.020 sys>
middle-sb  #<time-result 100000 times/ 13.149 real/ 13.140 user/  0.000 sys>
first-sb   #<time-result 100000 times/  0.013 real/  0.010 user/  0.000 sys>

Tags: string-scan, Strings

Post a comment

Name: