2013/02/04
Building a list with index---which way is faster?
It is often needed to create a list of length N, whose k-th
elements are calculated from k, e.g. (f k)
.
The shortest way is to map f over a sequence of indices:
(map f (iota N))
If it's a throwaway script I just do this. But if it's for
reusable code, creating a temporary list by (iota N)
looks
kind of waste. Is there a better way?
Using a lazy list for index can avoid creating a full intermediate list. However, we realize the entire list anyway, so the overhead of lazy calculation can be a drag.
(map f (liota N))
If f
is a bit of complicated expression, I like to use eager
comprehensions. This avoids intermediate list entirely.
The only downside is that when I don't want to depend on
the srfi-42
module (there's nothing wrong with it, but
there's a time that pulling entire srfi-42 just for this small
expression seems overkill.)
(list-ec (: i n) (f i))
What else? The comprehensive srfi-1
provides a universal
list builder, unfold
and unfold-right
.
(unfold (cut = n <>) f (cut + <> 1) 0) (unfold-right negative? f (cut - <> 1) (- n 1))
And this is the freshman's code.
(do ([i 0 (+ i 1)] [r '() (cons (- i) r)]) [(= i n) (reverse r)])
Now, time to benchmark! To measure the difference of
the perfomance of these looping constructs, I just pass -
as f
.
(define (timeit n) (time-these/report '(cpu 3) `((eager . ,(^[] (map - (iota n)))) (lazy . ,(^[] (map - (liota n)))) (ec . ,(^[] (list-ec (: i n) (- i)))) (unfold . ,(^[] (unfold (cut = n <>) - (cut + <> 1) 0))) (unfoldr . ,(^[] (unfold-right negative? - (cut - <> 1) (- n 1)))) (do . ,(^[] (do ([i 0 (+ i 1)] [r '() (cons (- i) r)]) [(= i n) (reverse r)]))))))
And the winner is... (drum)
gosh> (timeit 10) Benchmark: ran eager, lazy, ec, unfold, unfoldr, do, each for at least 3 cpu seconds. eager: 3.000 real, 3.030 cpu (3.030 user + 0.000 sys)@275029.70/s n=833340 lazy: 3.265 real, 3.280 cpu (3.270 user + 0.010 sys)@139736.89/s n=458337 ec: 3.100 real, 3.080 cpu (3.080 user + 0.000 sys)@243506.49/s n=750000 unfold: 3.072 real, 3.070 cpu (3.060 user + 0.010 sys)@274837.13/s n=843750 unfoldr: 3.275 real, 3.280 cpu (3.280 user + 0.000 sys)@474576.22/s n=1556610 do: 3.302 real, 3.270 cpu (3.260 user + 0.010 sys)@663932.42/s n=2171059 Rate eager lazy ec unfold unfoldr do eager 275030/s -- 1.968 1.129 1.001 0.580 0.414 lazy 139737/s 0.508 -- 0.574 0.508 0.294 0.210 ec 243506/s 0.885 1.743 -- 0.886 0.513 0.367 unfold 274837/s 0.999 1.967 1.129 -- 0.579 0.414 unfoldr 474576/s 1.726 3.396 1.949 1.727 -- 0.715 do 663932/s 2.414 4.751 2.727 2.416 1.399 -- #<undef> gosh> (timeit 100) Benchmark: ran eager, lazy, ec, unfold, unfoldr, do, each for at least 3 cpu seconds. eager: 3.266 real, 3.260 cpu (3.240 user + 0.020 sys)@36154.91/s n=117865 lazy: 3.034 real, 3.030 cpu (3.030 user + 0.000 sys)@19042.90/s n=57700 ec: 3.072 real, 3.070 cpu (3.060 user + 0.010 sys)@40719.87/s n=125010 unfold: 3.167 real, 3.160 cpu (3.150 user + 0.010 sys)@37299.05/s n=117865 unfoldr: 3.221 real, 3.230 cpu (3.230 user + 0.000 sys)@53212.07/s n=171875 do: 3.220 real, 3.220 cpu (3.220 user + 0.000 sys)@71172.05/s n=229174 Rate eager lazy ec unfold unfoldr do eager 36155/s -- 1.899 0.888 0.969 0.679 0.508 lazy 19043/s 0.527 -- 0.468 0.511 0.358 0.268 ec 40720/s 1.126 2.138 -- 1.092 0.765 0.572 unfold 37299/s 1.032 1.959 0.916 -- 0.701 0.524 unfoldr 53212/s 1.472 2.794 1.307 1.427 -- 0.748 do 71172/s 1.969 3.737 1.748 1.908 1.338 -- #<undef> gosh> (timeit 1000) Benchmark: ran eager, lazy, ec, unfold, unfoldr, do, each for at least 3 cpu seconds. eager: 3.264 real, 3.250 cpu (3.250 user + 0.000 sys)@3846.15/s n=12500 lazy: 3.024 real, 3.020 cpu (3.020 user + 0.000 sys)@1986.75/s n=6000 ec: 3.080 real, 3.090 cpu (3.090 user + 0.000 sys)@4414.24/s n=13640 unfold: 3.127 real, 3.130 cpu (3.130 user + 0.000 sys)@2284.35/s n=7150 unfoldr: 3.277 real, 3.290 cpu (3.290 user + 0.000 sys)@5453.19/s n=17941 do: 3.221 real, 3.220 cpu (3.220 user + 0.000 sys)@7320.81/s n=23573 Rate eager lazy ec unfold unfoldr do eager 3846/s -- 1.936 0.871 1.684 0.705 0.525 lazy 1987/s 0.517 -- 0.450 0.870 0.364 0.271 ec 4414/s 1.148 2.222 -- 1.932 0.809 0.603 unfold 2284/s 0.594 1.150 0.517 -- 0.419 0.312 unfoldr 5453/s 1.418 2.745 1.235 2.387 -- 0.745 do 7321/s 1.903 3.685 1.658 3.205 1.342 -- #<undef>
The freshman's code using do
is the clear winner.
(I'm using git HEAD).
This is a bit dissapointing, for my goal is to make the most concise and clear code run the fastest. I don't like to be forced to use lower-level abstraction just because of performance, for the code will eventually be cluttered with those "hand optimizations". It indicates the incompetency of the implementation.
For this goal, I guess I should optimize the eager comprehension---
it is a macro, so it's supposed to generate code as efficient
as hand-coded do
loop in such a simple case!
The performance of unfold-right
is notable.
I don't use unfold
family much, just because I always
forget the order of arguments and need to look up the manual.
Maybe I should make myself more familiar with it.
For the readers: Don't take this entry as "you should use do
loop for performance". It's fast in the current
versions of Gauche, but I'll improve it over time so that
the concise code can run as fast as the expanded do
loop.
★ ★ ★
(Edit: 2013/02/05 10:32:47 UTC): Peter Bex pointed me to list-tabulate
, which
is exactly for this kind of work. Aha! Here's the updated
benchmark. Yes, list-tabulate
is the winner!
(define (timeit n) (time-these/report '(cpu 3) `((eager . ,(^[] (map - (iota n)))) (lazy . ,(^[] (map - (liota n)))) (ec . ,(^[] (list-ec (: i n) (- i)))) (unfold . ,(^[] (unfold (cut = n <>) - (cut + <> 1) 0))) (unfoldr . ,(^[] (unfold-right negative? - (cut - <> 1) (- n 1)))) (do . ,(^[] (do ([i 0 (+ i 1)] [r '() (cons (- i) r)]) [(= i n) (reverse r)]))) (tabulate . ,(^[] (list-tabulate n -))))))
gosh> (timeit 1000) Benchmark: ran eager, lazy, ec, unfold, unfoldr, do, tabulate, each for at least 3 cpu seconds. eager: 3.312 real, 3.300 cpu (3.300 user + 0.000 sys)@3846.67/s n=12694 lazy: 3.056 real, 3.060 cpu (3.060 user + 0.000 sys)@1960.78/s n=6000 ec: 3.138 real, 3.140 cpu (3.140 user + 0.000 sys)@4777.07/s n=15000 unfold: 3.045 real, 3.050 cpu (3.050 user + 0.000 sys)@2459.02/s n=7500 unfoldr: 3.044 real, 3.040 cpu (3.040 user + 0.000 sys)@5550.99/s n=16875 do: 3.076 real, 3.090 cpu (3.090 user + 0.000 sys)@7281.55/s n=22500 tabulate: 3.030 real, 3.040 cpu (3.040 user + 0.000 sys)@8509.87/s n=25870 Rate eager lazy ec unfold unfoldr do tabulate eager 3847/s -- 1.962 0.805 1.564 0.693 0.528 0.452 lazy 1961/s 0.510 -- 0.410 0.797 0.353 0.269 0.230 ec 4777/s 1.242 2.436 -- 1.943 0.861 0.656 0.561 unfold 2459/s 0.639 1.254 0.515 -- 0.443 0.338 0.289 unfoldr 5551/s 1.443 2.831 1.162 2.257 -- 0.762 0.652 do 7282/s 1.893 3.714 1.524 2.961 1.312 -- 0.856 tabulate 8510/s 2.212 4.340 1.781 3.461 1.533 1.169 -- #<undef>
Tags: Performance, unfold, srfi-42
2013/01/29
New REPL
Upcoming 0.9.4 will have a slightly improved REPL---eval result history
is available via variables *1
etc, and it also shows current module in the prompt if you switch the module from the default user
.
(See Better REPL? for the history feature. Other features listed there haven't been ported.)
I admit it's not much. The point is that now we have a scaffold that can be easily extended later to try out new ideas. The new REPL is written in Scheme (in gauche.interactive
module) and loaded automatically when gosh
starts in the interactive mode.
Here I'd like to write about the implementation rather than the features.
Writing REPL in Scheme is trivial. There's a catch, though.
We use gosh
as a script interpreter as well as an interactive shell. If we use it as a script interpreter, we don't need REPL and we don't want to carry the extra baggage. (Currently the extended REPL is pretty simple, but it'll get bigger as we throw in more features).
This leads to an idea that we implement REPL in a separate module that can only be loaded when gosh
is in interactive mode. We already have such module: gauche.interactive
.
However, during development of Gauche itself, it is often a case that gosh
fails to load gauche.interactive
due to a bug. Writing REPL is like fixing a roof while you're on it; you can be stuck or fall if you make mistakes. Even in such a state we need some sort of basic REPL to investigate and fix the problem.
And we don't want to duplicate similar loop code. The difference between the basic REPL and the extended one is in evaluator, printer, etc., but the core loop construct should be shared.
So here's what I did.
- The core
libgauche
implements a built-in REPL, with the defaultreader
,evaluator
,printer
andprompter
as the minimal default. This is what we have inlibeval.scm
now, which is compatible to the one we had until 0.9.3, except it was written in C.(define-in-module gauche (read-eval-print-loop :optional (reader #f) (evaluator #f) (printer #f) (prompter #f)) (let ([reader (or reader read)] [evaluator (or evaluator eval)] [printer (or printer %repl-print)] [prompter (or prompter %repl-prompt)]) (let loop1 () (and (with-error-handler (^e (report-error e) #t) (^[] (let loop2 () (prompter) (let1 exp (reader) (and (not (eof-object? exp)) (receive results (evaluator exp (vm-current-module)) (apply printer results) (loop2))))))) (loop1)))))
- The
gauche.interactive
module implements extended evaluator (%evaluator
) and prompter (%prompter
). In future, we'll also extend reader and printer. Then it defines its ownread-eval-print-loop
, wrapping the built-in REPL with the extended handlers:(define (read-eval-print-loop :optional (reader #f) (evaluator #f) (printer #f) (prompter #f)) (let ([evaluator (or evaluator %evaluator)] [prompter (or prompter %prompter)]) ((with-module gauche read-eval-print-loop) reader evaluator printer prompter)))
- The
main.c
ofgosh
importsgauche.interactive
into theuser
module when it is invoked in interactive mode, then it evaluates(read-eval-print-loop)
. Ifgauche.interactive
is loaded successfully, the extended version is called because it shadows the built-in version. Ifgauche.interactive
can't be loaded, however, the built-in version is called.
Tags: 0.9.4, repl, gauche.interactive
2013/01/28
Curse of extended floating point arithmetic
In hindsight, Intel's decision of having 80bit extended-precision arithmetic internally in x87 was unfortunate. 64bit double-precision became pervasive, and the curse of double-rounding has bitten number of careless programmers.
And it finally bit me when I was implementing more reasonable flonum to rational conversion. To be precise, the issue had been already there, but that feature made me find a concrete example to exhibit the issue.
★ ★ ★
The new exact
procedure tries to find the simplest rational
number within the range where a flonum covers. With that scheme,
1.0000000000000002 (the "next" floating point number above 1.0 in
double-precision flonum) becomes
3002399751580332/3002399751580331.
gosh> (exact 1.0000000000000002) 3002399751580332/3002399751580331
However, users on i686 systems reported that converting the rational back to flonum yielded 1.0000000000000004. That is, 1 unit greater than the original number.
Let's see what was happening, exactly. First, save these big numerator and denominator in variables for the convenience.
gosh> (define x 3002399751580332) x gosh> (define y 3002399751580331) y
When we divide x by y with double precision arithmetic, what will we get?
gosh> (quotient&remainder (* x (expt 2 52)) y) 4503599627370497 1501199875790165 gosh> (* 1501199875790165 2) 3002399751580330 ; smaller than y=3002399751580331
Since the remainder is slightly less than the divisor,
we will round down the result and the mantissa will be
4503599627370497
, or, in binary:
1000...0001 ^53th bit
In normalized double precision number, the 1
in MSB is hidden,
and the rest of 52 bits fit into the 52bit mantissa field.
The fact that the remainder is so close to the divisor makes this division susceptible to the rounding errors. Let's see the extended precision division. 80-bit extended floating point number does not use hidden bit and the fraction part is 63 bit wide.
gosh> (quotient&remainder (* x (expt 2 63)) y) 9223372036854778879 3002399751579307
The quotient, in binary notation, is as follows:
1000...000101111111111 ^53th bit
However, the remainder is greater than the half of divisor, so it is rounded up in the FPU register:
1000...000110000000000 ^53th bit
When this extended number is stored into a double-precision floating point variable, the bits beyond 53-th one is rounded. since it falls exactly the half point of two double-precision floating numbers, "round-to-even" rule is applied, and the result becomes...
1000...0010 ^53th bit
Alas, the trap of double-rounding.
★ ★ ★
We can set up the control register to force the FPU to round at double-precision for every operation. The way to do so depends on OS, though, so I've been avoided it. This time there seemed no choice but using it---I could've carried out the division in bignum, as we shown above, and manually fitted the calculated 52bits into the flonum's mantissa field, but it would be huge performance loss, given that almost all cases FPU would yield correct results.
I tried to find a way to test if the input falls in this kind of
particular cases---if I could do that cheaply, I'd first apply the test
then would do the expensive calculation only in the special cases---
but couldn't find any. I ended up having special C preprocessor macros
in src/gauche/float.h
to set up FPU registers depending
on OSes.
2013/01/24
Searching multibyte strings
While integrating a patch in string.c
I rememberd a couple of
optimizations in string-scan
I had planned to add but long forgotten.
The current code uses Boyer-Moore if both source string and search string are single-byte, but falls back to a naive search method when either one is multibyte. That's because, if the internal encoding is EUC-JP or Shift_JIS, bytewise scan could lead a spurious match (matching with incorrect character boundaries). However, utf-8 is designed in a way that such spurious match is impossible. I just made it always use Boyer-Moore search.
Another possible optimization is when the length of search string is 1. It may seem that you'd rather search the character instead of searching a string of length 1 in such a case. Unfortunately, one character may occupy multiple bytes, so the current "search a character in a string" naively converts it to search of length 1 string. I just changed it to handle single-byte search specially.
The optimization is pretty effective. The following simple benchmark shows searching multibyte and single-byte string within a long multibyte strings. (It shows slowdown when multibyte match is at the beginning of the source string, since the optimized one has overhead of setting up Boyer-Moore skip table. But in such cases the the search returns quickly and degradation won't won't be a problem.)
Optimized:
last-mb #<time-result 100000 times/ 17.202 real/ 17.170 user/ 0.040 sys> middle-mb #<time-result 100000 times/ 8.700 real/ 8.700 user/ 0.000 sys> first-mb #<time-result 100000 times/ 0.023 real/ 0.020 user/ 0.000 sys> last-sb #<time-result 100000 times/ 8.827 real/ 8.830 user/ 0.000 sys> middle-sb #<time-result 100000 times/ 4.530 real/ 4.520 user/ 0.000 sys> first-sb #<time-result 100000 times/ 0.011 real/ 0.020 user/ 0.000 sys>
Original:
last-mb #<time-result 100000 times/ 29.267 real/ 29.240 user/ 0.020 sys> middle-mb #<time-result 100000 times/ 14.604 real/ 14.590 user/ 0.010 sys> first-mb #<time-result 100000 times/ 0.014 real/ 0.020 user/ 0.000 sys> last-sb #<time-result 100000 times/ 26.303 real/ 26.290 user/ 0.020 sys> middle-sb #<time-result 100000 times/ 13.149 real/ 13.140 user/ 0.000 sys> first-sb #<time-result 100000 times/ 0.013 real/ 0.010 user/ 0.000 sys>
Tags: string-scan, Strings
2013/01/23
Fun with primes
Every year, on new year's holidays I tend to play with writing algorithms that aren't directly related to work. (In Japanese tradition, we write calligraphy in the beginning of each year, called 書き初め (first writing). Mine is a bit like that---first coding.)
Last year I played with procedural modeling using Markov-chain Monte Carlo
(WiLiKi:Gauche:MetropolisProceduralModeling). Initially I planned to pick some graphics-related papers again. However, @cddddr wrote a nice entry of generating infinite prime sequence taking advantage of Gauche's generators and I thought that'd be something nice to have in Gauche, so I decided to add math.prime
module and to throw in some utilities related to prime numbers.
First thing the module offer is an infinite lazy sequence of prime numbers *primes*
. I optimize the original code by @cddddr a bit, notably to avoid using continuations; currently it can generate first 10^7 primes in 14 seconds on 2.4GHz Core2 machine.
gosh> (time (list-ref *primes* #e1e7)) ;(time (list-ref *primes* 10000000)) ; real 14.297 ; user 14.060 ; sys 0.240 179424691
Primality tests are also suitable to be in the math.prime
module, right?
I only knew Miller-Rabin probabilistic primality test, so I googled
a bit to hunt around other options. I found out a few interesting things.
- Miller-Rabin tests doesn't need to be probabilistic. Up to certain numbers it is known to give definite answer with the selected "witness" values. Wikipedia lists up to 341550071728321---for a number below that we can say definitely it's a prime or not, by only using up to first 7 primes (2 to 17) as witnesses. If I increase the number of witnesses this bound can be higher, though I found out that the following BPSW test becomes faster than 8 or more rounds of Miller-Rabin tests.
- Baillie-PSW primality test is, like Miller-Rabin, a test that may tell a pseudoprime as a prime, but it is verified that no pseudoprime exist under 2^64, according to the page (the author of the page says independent check is needed, though). So up to that range it can be used as a deterministic test.
- Then there's AKS primality test, fastest known determinisic primality test.
I implemented Miller-Rabin and BPSW tests but haven't done AKS test; running out of time.
What else would be nice to have? How about prime factorization? This, again, is very interesting topic to search for various algorithms. I found This answer in stackoverflow and used it as a starting point. I ended up writing Monte-Carlo factorization (Brent's modification of Pollard's Rho alrogithm), as well as the naive "divide by every prime" approach.
There's something fascinating in prime factorization. I found myself blankly typing random numbers to factorize, again and again...
gosh> (mc-factorize 435728572856238754559) (104579 1018217 4091958413) gosh> (mc-factorize 984597945623658375937) (79 12463265134476688303)
If the number goes around 20digits in decimal or so, sometimes Monte-Carlo factorization takes very long. Factoring 19740274219868223167 took several minutes.
gosh> (mc-factorize 19740274219868223167) (2971215073 6643838879)
I ran mc-factorize
on 7-th fermat number (2^128+1) for more than a day but it couldn't find factors.
I thought I'd stop here, but when I tried WolframAlpha it factorized those big numbers without sweat, in seconds. Ok, I'd proceed to Elliptic curve factorization... some day...
The module also provides Jacobi symbol (used in BPSW test), and also Euler's totient function. Following Wikipedia links reveals more and more interesting functions (Mobius function? What's a cool name!) but it'd be endless process so I had to stop, for now.
I have no prospect to use this module in any practical work so far---if you need serious calculation in these fields, there already exist more general and optimized packages like PARI/GP, so I think math.prime
module has little practical importance. Nonetheless, it was fun to make those things run---and who knows someday it might come in handy.
Tags: math.prime, Prime
Comments (2)