Gauche Devlog

< 0.9.5, and what's next | Little improvement of online doc >

2016/11/25

Splitting generators

(This entry is inspired by a private email exchange.)

Generators (ref:gauche.generator - Generators, srfi:121) are very handy abstraction to construct a data-flow pipeline. You can put together small parts, each of which modifies data streams locally, to carry out the whole computation. You can also concatenate (ref:gappend, ref:gconcatenate, ref:gflatten) or merge (ref:gmerge) generators.

But how about splitting? If you think in terms of data-flow, you sometimes want to split a stream, say, you have an incoming stream of integers and want to have two output streams, one with odd numbers and another with even integers. Why don't we have gpartition?

Once you start writing it, it becomes immediately apparent that you need infinite buffering. You split generator of integers to odds and evens. You try to read from odds. But what if the input generator yields even numbers consecutively? You have to save them up so that they can be retrieved later when you read from "even" generators.

It's doable, using a pair of queues for example, but it'll be a bit messy.

In fact, there's a lot simpler solution.

(use gauche.lazy)
(use gauche.generator)

(define (gpartition pred input-gen)
  (let1 input-lseq (generator->lseq input-gen)
    (values (list->generator (lfilter pred input-lseq))
            (list->generator (lfilter (complement pred) input-lseq)))))

That is, convert the input generator to a lazy seq (lseq), creates two lazy sequences from it, and convert them back to generators. The buffering is implicitly handled as a realized portion of the lazy sequence.

Let's split stream of integers into odds and evens:

gosh> (define-values (o e) (gpartition odd? (grange 0)))
e
gosh> (o)
1
gosh> (o)
3
gosh> (o)
5
gosh> (o)
7
gosh> (o)
9
gosh> (e)
0
gosh> (e)
2
gosh> (o)
11

However, if you do this, you might as well build the whole network using lseqs, without bothering to convert generators and lseqs back and forth.

This is a sort of unfortunate situation that you have to choose. In languages like Haskell, there's no question that you'll go with lazy sequences from the first place. In Gauche, and in Scheme general, lazy streams incur certain overhead compared to eager solutions and you need to speculate, before starting to write code, which way would lead a good balance between code clarity and performance.

Fully lazy stream (like srfi:40 and srfi:41) needs to allocate a thunk for each element, unless the implementation has special optimizations. Generators can eliminate per-item allocation at all. Lightweight lseqs (such as Gauche's ref:Lazy sequences and srfi:127) can be a good compromise, for it requires to allocate just one pair per item, and allocation of pairs is likely to be optimized in wider range of implementations. At this moment, I'd suggest to go with lseqs as a first choice, and if you can't afford allocations at all you'd switch to full-generator design.

Btw, with srfi-127, gpartition can be written as follows:

(use srfi-127)

(define (gpartition pred input-gen)
  (let1 input-lseq (generator->lseq input-gen)
    (values (lseq->generator (lseq-filter pred input-lseq))
            (lseq->generator (lseq-remove pred input-lseq)))))

Gauche has srfi-127 in the development HEAD.

Tags: generator, lseq, srfi-121, srfi-127

Post a comment

Name: