Gauche Devlog

< Math fun | Checking scripts >


And here comes random data generators

I just checked in data.random, a collection of random data generators and their combinators. The names of API functions are not yet fixed, but I think the overall it's in a good shape. (Since 0.9.4 is overdue, I might be going to release it without making data.random official. I'm not sure yet.)

Here's the code:;a=blob;f=lib/data/random.scm;hb=HEAD

It provides a bunch of primitive random generators such as followings.

  • uniform distribution
    • (integer size :optional (start 0)) returns a generator that produces random integer between start and start+size-1, uniformly.
    • (integer-between lo hi) returns a generator that produces random integer between lo and hi (both inclusive).
    • int8, uint8 etc. are preset generators to produce the range their name suggest.
    • (char :optional cset) returns a generator of random characters from a character set. When omitted, we use #[A-Za-z0-9] as the default character set.
    • We also have boolean, real, real-between.
    • We want to have exact rational generators and complex generators, but I wonder how the range and distribution should be specified.
  • nonuniform distribution
    • For discrete sampling, we have geometric and poisson distribution.
    • For continuous sampling, we have normal and exponential distribution.

Then, those generators can be combined to make more complex generators.

  • random choice
    • (one-of generators) returns a generator that picks one generator in generators randomly to produce the next value.
    • (weighted-sample weight&generators) allows you to specify weight of selection probability for each generators.
  • aggregate data
    • (pair-of gen1 gen2), (tuple-of gen ...)
    • list-of, vector-of, string-of - these combinators can be called in two different forms, e.g.
      • (list-of sizer item-gen): sizer can be an integer, or an integer-generator, to give the length of the resulting list. item-gen is a generator to produce elements.
      • (list-of item-gen): If sizer is omitted, we use some default generator to determine the length of the resulting list. Currently I use (poisson 4) provisionally.

I also have permutation-of and combination-of, which takes a list of items (not item generators).

What I like about the current shape is that those generators can be combined using gauche.generator framework as well; e.g. you can have series of sum of two dice rolling by:

    (gmap + (integer-between 1 6) (integer-between 1 6))

or apply a filter:

    (gfilter (cut < 0 <> 1) (exponential 1))

or taking some values into a list:

    (generator->list (poisson 5) 10)

Here are some elements about API I'm still pondering about:

  • We have procedures that creates a generator (e.g. integer, real, char) and pre-created generators (e.g. fixnum, int8). Without the static typing support, this kind of layers could be confusing. Shall we use some naming convention to distinguish these two layers?
  • There's an idea rolling in my head to provide plural names as an alias, e.g. chars for char. It plays nicely with the combinators, e.g. (list-of fixnums) or (string-of 5 (chars)). But I also feel this is just a superficial convenience; we double the number of exported names to get nothing added functionally.
  • The handling of omitted argument of list-of etc. is also different from Gauche's convention of optional arugments.

If you have data generator ideas to be thrown in to this module, let me know.

Now I'm writing a generative test framework, using this module as a data generators.

Tags: data.random, Generators

Post a comment