2013/03/12
And here comes random data generators
I just checked in data.random, a collection of random
data generators and their combinators. The names of API functions
are not yet fixed, but I think the overall it's in a good shape.
(Since 0.9.4 is overdue, I might be going to release it without
making data.random official. I'm not sure yet.)
Here's the code: http://gauche.git.sourceforge.net/git/gitweb.cgi?p=gauche/Gauche;a=blob;f=lib/data/random.scm;hb=HEAD
It provides a bunch of primitive random generators such as followings.
- uniform distribution
(integer size :optional (start 0))returns a generator that produces random integer between start and start+size-1, uniformly.(integer-between lo hi)returns a generator that produces random integer between lo and hi (both inclusive).int8,uint8etc. are preset generators to produce the range their name suggest.(char :optional cset)returns a generator of random characters from a character set. When omitted, we use#[A-Za-z0-9]as the default character set.- We also have
boolean,real,real-between. - We want to have exact rational generators and complex generators, but I wonder how the range and distribution should be specified.
- nonuniform distribution
- For discrete sampling, we have geometric and poisson distribution.
- For continuous sampling, we have normal and exponential distribution.
Then, those generators can be combined to make more complex generators.
- random choice
(one-of generators)returns a generator that picks one generator in generators randomly to produce the next value.(weighted-sample weight&generators)allows you to specify weight of selection probability for each generators.
- aggregate data
(pair-of gen1 gen2),(tuple-of gen ...)list-of,vector-of,string-of- these combinators can be called in two different forms, e.g.(list-of sizer item-gen): sizer can be an integer, or an integer-generator, to give the length of the resulting list. item-gen is a generator to produce elements.(list-of item-gen): If sizer is omitted, we use some default generator to determine the length of the resulting list. Currently I use(poisson 4)provisionally.
I also have permutation-of and combination-of, which takes
a list of items (not item generators).
What I like about the current shape is that those generators can be
combined using gauche.generator framework as well; e.g. you
can have series of sum of two dice rolling by:
(gmap + (integer-between 1 6) (integer-between 1 6))
or apply a filter:
(gfilter (cut < 0 <> 1) (exponential 1))
or taking some values into a list:
(generator->list (poisson 5) 10)
Here are some elements about API I'm still pondering about:
- We have procedures that creates a generator (e.g.
integer,real,char) and pre-created generators (e.g.fixnum,int8). Without the static typing support, this kind of layers could be confusing. Shall we use some naming convention to distinguish these two layers? - There's an idea rolling in my head to provide plural names as an
alias, e.g.
charsforchar. It plays nicely with the combinators, e.g.(list-of fixnums)or(string-of 5 (chars)). But I also feel this is just a superficial convenience; we double the number of exported names to get nothing added functionally. - The handling of omitted argument of
list-ofetc. is also different from Gauche's convention of optional arugments.
If you have data generator ideas to be thrown in to this module, let me know.
Now I'm writing a generative test framework, using this module as a data generators.
Tags: data.random, Generators
2013/03/11
Math fun
Recently I added a few math functions in the core of Gauche, for they were needed to implement some more domain-specific functions.
Just committed are gamma and lgamma,
Gamma function and natural logarithm of
absolute value of it.
(What prompted me to add them was the random variable of
Poisson distribution; the algorithm I found needed log(n!).
And the reason I was doing Poisson distribution thing was
that I was writing a general random data generator module and
I wanted to cover typical distributions. And the reason that
I was writing such a module was that they were needed for
generative test framework. It's like a domino effect; I'm writing
X, then finds X needs Y, then Y needs Z, etc...)
Some time ago I also added expt-mod, that calculates modulus
exponentiation.
I used to be worried for adding too much stuff to the core library, and tried to put functions together into a module whenever I could attach a suitable name for the bunch. However, some functions are just fundamental enough that appear in different fields now and then, and the core library may be the suitable place for them, after all.
2013/02/23
Land of Lisp on Gauche
I've been working on translating Conrad Barski's "Land of Lisp" to Japanese, and it finally hit the bookstores in Japan (published from O'Reilly Japan). To celebrate that, I ported the code in the book to Gauche.
Since Gauche adopts quite a few Common Lisp idioms, the port is pretty straightforward. The code may serve as an example to show how to translate CL-style code into Gauche.
Here are some observations.
- Most of CL's
loopuse cases can be written using srfi:42.- The trick is that, in CL, you think "OK, I loop over this (list, vector,
integer, etc.) and I want to get this (list, sum, etc.)", while
in srfi-42, you first think what you want (ok, I want list so I use
list-ec/ ok, I want to stop iteration when any of the element satisfies the condition, so I useany?-ec), then you think about what to iterate.
- The trick is that, in CL, you think "OK, I loop over this (list, vector,
integer, etc.) and I want to get this (list, sum, etc.)", while
in srfi-42, you first think what you want (ok, I want list so I use
- CL has many idioms that take advantage of () being boolean false, and
car/cdrofnilis stillnil. It is not a good idea to carry over that idiom to Scheme, or you'll be frustrated feeling that Scheme's distinction of#fand()gets in your way. It just needs a different mindset (complaining about it is like a C programmer complaining 0 isn't false in some languages).- For example, in CL, you can access to the value associated to the key
in alist by the following code, without worrying the case when key isn't
found:
(cdr (assoc key alist))
You'll be annoyed that in Scheme, assoc returns#fwhen key isn't found, andcdrbarfs. In Gauche, recommended way is as follows:(assoc-ref alist key :optional default)
- Another common CL idiom is to build a list conditionally:
(append (list up down) (unless (zerop (mod pos *board-size*)) (list (1- up) (1- pos))) (unless (zerop (mod (1+ pos) *board-size*)) (list (1+ pos) (1+ down))))This idiom uses the fact thatunlessreturnsnilwhen the condition is satisfied. In Gauche,unlessreturns#<undef>so this doesn't work. Instead of falling back to more verboseif, you can usecond-list.(cond-list [#t @ (list up down)] [(not (zero? (mod pos *board-size*))) @ (list (- up 1) (- pos 1))] [(not (zero? (mod (+ pos 1) *board-size*))) @ (list (+ pos 1) (+ down 1))])
- For example, in CL, you can access to the value associated to the key
in alist by the following code, without worrying the case when key isn't
found:
- The lazy evaluation stuff built in Chapter 18 (
lazy.lisp) comes free in Gauche, asutil.streammodule. Thelazy.scmfile in Gauche version only shows the correspondence of two APIs. - I have a plan to make
formatCL-compatible. It's not done yet, so I had to rewrite some code using advanced formatting directives with the plain Scheme code. - The CL's
defstructcode is translated todefine-record-typestraightforwardly, except that our record type can't set the default value when the constructor argument is omitted. (R6RS records can do that using protocol. I prefer easier way, though.)
Tag: CommonLisp
2013/02/04
Building a list with index---which way is faster?
It is often needed to create a list of length N, whose k-th
elements are calculated from k, e.g. (f k).
The shortest way is to map f over a sequence of indices:
(map f (iota N))
If it's a throwaway script I just do this. But if it's for
reusable code, creating a temporary list by (iota N) looks
kind of waste. Is there a better way?
Using a lazy list for index can avoid creating a full intermediate list. However, we realize the entire list anyway, so the overhead of lazy calculation can be a drag.
(map f (liota N))
If f is a bit of complicated expression, I like to use eager
comprehensions. This avoids intermediate list entirely.
The only downside is that when I don't want to depend on
the srfi-42 module (there's nothing wrong with it, but
there's a time that pulling entire srfi-42 just for this small
expression seems overkill.)
(list-ec (: i n) (f i))
What else? The comprehensive srfi-1 provides a universal
list builder, unfold and unfold-right.
(unfold (cut = n <>) f (cut + <> 1) 0) (unfold-right negative? f (cut - <> 1) (- n 1))
And this is the freshman's code.
(do ([i 0 (+ i 1)]
[r '() (cons (- i) r)])
[(= i n) (reverse r)])
Now, time to benchmark! To measure the difference of
the perfomance of these looping constructs, I just pass -
as f.
(define (timeit n)
(time-these/report '(cpu 3)
`((eager . ,(^[] (map - (iota n))))
(lazy . ,(^[] (map - (liota n))))
(ec . ,(^[] (list-ec (: i n) (- i))))
(unfold . ,(^[] (unfold (cut = n <>) -
(cut + <> 1)
0)))
(unfoldr . ,(^[] (unfold-right negative? -
(cut - <> 1)
(- n 1))))
(do . ,(^[] (do ([i 0 (+ i 1)]
[r '() (cons (- i) r)])
[(= i n) (reverse r)]))))))
And the winner is... (drum)
gosh> (timeit 10)
Benchmark: ran eager, lazy, ec, unfold, unfoldr, do, each for at least 3 cpu seconds.
eager: 3.000 real, 3.030 cpu (3.030 user + 0.000 sys)@275029.70/s n=833340
lazy: 3.265 real, 3.280 cpu (3.270 user + 0.010 sys)@139736.89/s n=458337
ec: 3.100 real, 3.080 cpu (3.080 user + 0.000 sys)@243506.49/s n=750000
unfold: 3.072 real, 3.070 cpu (3.060 user + 0.010 sys)@274837.13/s n=843750
unfoldr: 3.275 real, 3.280 cpu (3.280 user + 0.000 sys)@474576.22/s n=1556610
do: 3.302 real, 3.270 cpu (3.260 user + 0.010 sys)@663932.42/s n=2171059
Rate eager lazy ec unfold unfoldr do
eager 275030/s -- 1.968 1.129 1.001 0.580 0.414
lazy 139737/s 0.508 -- 0.574 0.508 0.294 0.210
ec 243506/s 0.885 1.743 -- 0.886 0.513 0.367
unfold 274837/s 0.999 1.967 1.129 -- 0.579 0.414
unfoldr 474576/s 1.726 3.396 1.949 1.727 -- 0.715
do 663932/s 2.414 4.751 2.727 2.416 1.399 --
#<undef>
gosh> (timeit 100)
Benchmark: ran eager, lazy, ec, unfold, unfoldr, do, each for at least 3 cpu seconds.
eager: 3.266 real, 3.260 cpu (3.240 user + 0.020 sys)@36154.91/s n=117865
lazy: 3.034 real, 3.030 cpu (3.030 user + 0.000 sys)@19042.90/s n=57700
ec: 3.072 real, 3.070 cpu (3.060 user + 0.010 sys)@40719.87/s n=125010
unfold: 3.167 real, 3.160 cpu (3.150 user + 0.010 sys)@37299.05/s n=117865
unfoldr: 3.221 real, 3.230 cpu (3.230 user + 0.000 sys)@53212.07/s n=171875
do: 3.220 real, 3.220 cpu (3.220 user + 0.000 sys)@71172.05/s n=229174
Rate eager lazy ec unfold unfoldr do
eager 36155/s -- 1.899 0.888 0.969 0.679 0.508
lazy 19043/s 0.527 -- 0.468 0.511 0.358 0.268
ec 40720/s 1.126 2.138 -- 1.092 0.765 0.572
unfold 37299/s 1.032 1.959 0.916 -- 0.701 0.524
unfoldr 53212/s 1.472 2.794 1.307 1.427 -- 0.748
do 71172/s 1.969 3.737 1.748 1.908 1.338 --
#<undef>
gosh> (timeit 1000)
Benchmark: ran eager, lazy, ec, unfold, unfoldr, do, each for at least 3 cpu seconds.
eager: 3.264 real, 3.250 cpu (3.250 user + 0.000 sys)@3846.15/s n=12500
lazy: 3.024 real, 3.020 cpu (3.020 user + 0.000 sys)@1986.75/s n=6000
ec: 3.080 real, 3.090 cpu (3.090 user + 0.000 sys)@4414.24/s n=13640
unfold: 3.127 real, 3.130 cpu (3.130 user + 0.000 sys)@2284.35/s n=7150
unfoldr: 3.277 real, 3.290 cpu (3.290 user + 0.000 sys)@5453.19/s n=17941
do: 3.221 real, 3.220 cpu (3.220 user + 0.000 sys)@7320.81/s n=23573
Rate eager lazy ec unfold unfoldr do
eager 3846/s -- 1.936 0.871 1.684 0.705 0.525
lazy 1987/s 0.517 -- 0.450 0.870 0.364 0.271
ec 4414/s 1.148 2.222 -- 1.932 0.809 0.603
unfold 2284/s 0.594 1.150 0.517 -- 0.419 0.312
unfoldr 5453/s 1.418 2.745 1.235 2.387 -- 0.745
do 7321/s 1.903 3.685 1.658 3.205 1.342 --
#<undef>
The freshman's code using do is the clear winner.
(I'm using git HEAD).
This is a bit dissapointing, for my goal is to make the most concise and clear code run the fastest. I don't like to be forced to use lower-level abstraction just because of performance, for the code will eventually be cluttered with those "hand optimizations". It indicates the incompetency of the implementation.
For this goal, I guess I should optimize the eager comprehension---
it is a macro, so it's supposed to generate code as efficient
as hand-coded do loop in such a simple case!
The performance of unfold-right is notable.
I don't use unfold family much, just because I always
forget the order of arguments and need to look up the manual.
Maybe I should make myself more familiar with it.
For the readers: Don't take this entry as "you should use do
loop for performance". It's fast in the current
versions of Gauche, but I'll improve it over time so that
the concise code can run as fast as the expanded do loop.
★ ★ ★
(Edit: 2013/02/05 10:32:47 UTC): Peter Bex pointed me to list-tabulate, which
is exactly for this kind of work. Aha! Here's the updated
benchmark. Yes, list-tabulate is the winner!
(define (timeit n)
(time-these/report '(cpu 3)
`((eager . ,(^[] (map - (iota n))))
(lazy . ,(^[] (map - (liota n))))
(ec . ,(^[] (list-ec (: i n) (- i))))
(unfold . ,(^[] (unfold (cut = n <>) -
(cut + <> 1)
0)))
(unfoldr . ,(^[] (unfold-right negative? -
(cut - <> 1)
(- n 1))))
(do . ,(^[] (do ([i 0 (+ i 1)]
[r '() (cons (- i) r)])
[(= i n) (reverse r)])))
(tabulate . ,(^[] (list-tabulate n -))))))
gosh> (timeit 1000)
Benchmark: ran eager, lazy, ec, unfold, unfoldr, do, tabulate, each for at least 3 cpu seconds.
eager: 3.312 real, 3.300 cpu (3.300 user + 0.000 sys)@3846.67/s n=12694
lazy: 3.056 real, 3.060 cpu (3.060 user + 0.000 sys)@1960.78/s n=6000
ec: 3.138 real, 3.140 cpu (3.140 user + 0.000 sys)@4777.07/s n=15000
unfold: 3.045 real, 3.050 cpu (3.050 user + 0.000 sys)@2459.02/s n=7500
unfoldr: 3.044 real, 3.040 cpu (3.040 user + 0.000 sys)@5550.99/s n=16875
do: 3.076 real, 3.090 cpu (3.090 user + 0.000 sys)@7281.55/s n=22500
tabulate: 3.030 real, 3.040 cpu (3.040 user + 0.000 sys)@8509.87/s n=25870
Rate eager lazy ec unfold unfoldr do tabulate
eager 3847/s -- 1.962 0.805 1.564 0.693 0.528 0.452
lazy 1961/s 0.510 -- 0.410 0.797 0.353 0.269 0.230
ec 4777/s 1.242 2.436 -- 1.943 0.861 0.656 0.561
unfold 2459/s 0.639 1.254 0.515 -- 0.443 0.338 0.289
unfoldr 5551/s 1.443 2.831 1.162 2.257 -- 0.762 0.652
do 7282/s 1.893 3.714 1.524 2.961 1.312 -- 0.856
tabulate 8510/s 2.212 4.340 1.781 3.461 1.533 1.169 --
#<undef>
Tags: Performance, unfold, srfi-42
2013/01/29
New REPL
Upcoming 0.9.4 will have a slightly improved REPL---eval result history
is available via variables *1 etc, and it also shows current module in the prompt if you switch the module from the default user.
(See Better REPL? for the history feature. Other features listed there haven't been ported.)
I admit it's not much. The point is that now we have a scaffold that can be easily extended later to try out new ideas. The new REPL is written in Scheme (in gauche.interactive module) and loaded automatically when gosh starts in the interactive mode.
Here I'd like to write about the implementation rather than the features.
Writing REPL in Scheme is trivial. There's a catch, though.
We use gosh as a script interpreter as well as an interactive shell. If we use it as a script interpreter, we don't need REPL and we don't want to carry the extra baggage. (Currently the extended REPL is pretty simple, but it'll get bigger as we throw in more features).
This leads to an idea that we implement REPL in a separate module that can only be loaded when gosh is in interactive mode. We already have such module: gauche.interactive.
However, during development of Gauche itself, it is often a case that gosh fails to load gauche.interactive due to a bug. Writing REPL is like fixing a roof while you're on it; you can be stuck or fall if you make mistakes. Even in such a state we need some sort of basic REPL to investigate and fix the problem.
And we don't want to duplicate similar loop code. The difference between the basic REPL and the extended one is in evaluator, printer, etc., but the core loop construct should be shared.
So here's what I did.
- The core
libgaucheimplements a built-in REPL, with the defaultreader,evaluator,printerandprompteras the minimal default. This is what we have inlibeval.scmnow, which is compatible to the one we had until 0.9.3, except it was written in C.(define-in-module gauche (read-eval-print-loop :optional (reader #f) (evaluator #f) (printer #f) (prompter #f)) (let ([reader (or reader read)] [evaluator (or evaluator eval)] [printer (or printer %repl-print)] [prompter (or prompter %repl-prompt)]) (let loop1 () (and (with-error-handler (^e (report-error e) #t) (^[] (let loop2 () (prompter) (let1 exp (reader) (and (not (eof-object? exp)) (receive results (evaluator exp (vm-current-module)) (apply printer results) (loop2))))))) (loop1))))) - The
gauche.interactivemodule implements extended evaluator (%evaluator) and prompter (%prompter). In future, we'll also extend reader and printer. Then it defines its ownread-eval-print-loop, wrapping the built-in REPL with the extended handlers:(define (read-eval-print-loop :optional (reader #f) (evaluator #f) (printer #f) (prompter #f)) (let ([evaluator (or evaluator %evaluator)] [prompter (or prompter %prompter)]) ((with-module gauche read-eval-print-loop) reader evaluator printer prompter))) - The
main.cofgoshimportsgauche.interactiveinto theusermodule when it is invoked in interactive mode, then it evaluates(read-eval-print-loop). Ifgauche.interactiveis loaded successfully, the extended version is called because it shadows the built-in version. Ifgauche.interactivecan't be loaded, however, the built-in version is called.
Tags: 0.9.4, repl, gauche.interactive

Comments (0)