2010/09/19
Code is soft, interface is hard
I was expecting to push 0.9.1 out of the door sometime in June or July, which has long past. The development trunk has 0.9.1_pre1 version tag for months, and I'm using new features are actively without a problem. So what's pulling back the release?
There are couple of new API sets that I'm not sure are right. I like an API that are simple but powerful; easier to do typical things but allowing to do complicated things if you want.
Thanks to the flexibility of Scheme it is relatively easy to have simple calls that covers easy use cases, and later adding options that augment them to handle atypical situations.
However, that approach is susceptible to creeping featurism; ending up tons of options that modifies the behaviour of API in various ways, sometimes incompatibly each other. It forces a programmer to look up references constantly, and when a new atypical situation arises, it is likely that he can't deal with it with existing options and end up modifying the library code to support a new option.
So, it is often the case that difficult challenges are not introducing the first version of a new module that supports the most basic functionalities, but releasing a second version to support complicated use cases.
And while code is malleable, interface is not. Once you release a public interface and other people's writes code on top of that, it becomes very difficult to introduce incompatible changes. The difficulty is that, an interface design needs to be evaluated by the actual applications that uses it, but to have enough applications you have to make the interface widely available, hence making the change more difficult.
At this moment, the biggest issue I'm stuck at is in rfc.http. The original API covers simple use cases and works well for the job. But it has shown its shortcomings as I write more applications; especially HTTP becomes a substrate of more involved protocols.
What I want in the revised rfc.http module:
- Persistent connections: This hasn't been implemented, but the infrastructure is there, and I expect it can be introduced as a natural extension of the exiting API. The current API will be extended to take <http-connection> object in place of the 'server' argument.
- Secure connections: There are practical needs for this, so I hacked up an easy solution, though it may be extended later. Right now I call "stunnel" as external process and delegates SSL handling to it; at API level, a new keyword argument :secure switches the use of SSL. In future, if Gauche supports SSL as in-process extension, we can just switch to it. This option is orthogonal to other options, so it's acceptable.
- Parameterized http method: Current API has different functions for each http method, e.g. http-get, http-post etc. This is good if I just want to retrieve an html page or post an html form. However, when some other protocol uses http as a lower layer, it gets a bit awkward; it is more natural for the code to give http method as a parameter to a single http handling function. This motivates me to create a single point of entry (in the trunk, it is http-request), and redefine existing http-get etc. as convenience procedures on top of it.
- Flexible handling of message body: This is the biggest issue, and I haven't resolved it yet. I'll discuss it in a separate entry.
2010/06/18
div, mod, div0, mod0
In the process of implementing R6RS compatibility functions, I'm trying to represent R6RS's integer division div, mod, div0 and mod0 in terms of R5RS's integer division quotient, modulo and remainder, but I'm always confused with their behaviors when the divisor and/or the divident is/are negative. So I visualize them.
R5RS integer divisions:
R6RS integer divisions:
I have a mixed feeling towards R6RS but I like this particular change; div/mod looks simpler and easier to explain (besides, not restricting arguments to integers is an interesting generalization).
To represent div/mod/div0/mod0 in terms of quotient/modulo/remainder, it looks that I have to branch for each case of 4 combinations of signs of divisor and dividend.
2010/06/11
Constructing circular structures
Since I brought up circular structures in the last entry, I describe another related issue that hasn't solved yet.
Using #n= and #n# notation (srfi:38), you can give an external representation to a circular structure. When read in, you'll get a structure with the same topology as DAG. It is pretty handy in practice.
A simple way to reconstruct a circular structure is that you allocate the container first, and fills its content later; e.g. if you see #1=(#1#), you allocate a cons cell and register it with label '1' first; then when you encounter #1# you set-car! the pointer to the cons cell to the car of the cons cell.
As trivial as it sounds, this scheme doesn't play well with functional style, where once an object is constructed it is immutable. Actually it is more than a matter of style.
We have srfi:10 that allows user-defined structures to have a readable extrenal representation. Which is also pretty handy in practice, and I've used the feature in production works number of times (it's too bad that R6RS conflicts with srfi-10; srfi-10 preceded years before R6RS.)
Unfortunately, srfi-10 requires all the initialization arguments must be present before constructing the object. So, there seems no clear way to mix circular reference and srfi-10, such as the following:
#1=#,(my-record (x y #1#))
Gauche has a very ad-hoc solution for this. In fact, it's too ugly that I haven't made it official, although it's been in the code for years. In the current implementation, when a reader encounters a shared reference that hasn't been realized (#1# in the above example), it creates a special read-reference object to pass to the constructor. When a constructor sees a read-reference object, it can register a callback procedure which will be called when the reference has a concrete value. The callback procedure then replaces read references for the given values.
I don't like this solution since it doesn't have enough abstraction barrier; if a user-registered read-time constructor doesn't register the callback, for example, the read object contains read-reference objects, which we don't want to expose to the rest of the program.
There are several ways to address this issue.
Implicit pointer swapping
This automates the callback process; if we can know all the pointers in the objects created by read-time constructor, we can trace the objects after all the shared labels are realized, and replaces pointers to the read references for pointers to the real objects.
This would be trivial with precise GC. With Gauche's conservative GC, however, we need to ask user-defined structures with read-time constructors to register pointer layout of the objects. Besides, the structure created by read can be arbitrary large, and I wonder if it can be efficient enough.
Implicit pointer forwarding
If we have something like implicit future, we can wrap the to-be-constructed shared reference by a future to pass to the constructor. Implicit means the existence of the future is transparent to the program; if the code tries to use the future as a value, the system implicitly returns wrapped value. (It somewhat resembles with implicit forcing of promises, but the realization of the value is done by the time read returns, without explicitly forcing it, so calling it a future seems more appropriate. Cf. http://www.ps.uni-saarland.de/alice/manual/futures.html ).
This is probably cleanest solution since side effects are nicely abstracted, and also more general, for implicit futures can be used in various other places.
The obstacle is that we have to insert checks to see whether the given Scheme object is a future everywhere. For Scheme programs it is not a big deal since VM handles it all, but we also have to consider C code that deal with ScmObjs. For primitive aggregate types such as pairs and vectors, C API exposes accessor macros so we can insert future check transparently; but other type of C structs that contains ScmObjs.... I'm not sure we can cover every cases.
Two-phase construction
This idea is to split read-time constructor to allocate phase and initialize phase, like CLOS. It encapsulates side effects into initialize method, which is good. Srfi-10 style read-time constructors can be understood as a combination of allocate and initialize, and we can raise an error if circular reference is involved in structures that don't offer separate allocate and initialize methods.
An issue is that there are structures that adopt single-phase construction protocol. Some built-in objects do, and also the record system (srfi:9, srfi:99) assumes it.
This would work if we can say all objects should be able to be constructed in two phases in Gauche; otherwise it would be very confusing.
★ ★ ★
I'm currently swinging between implicit future and two-phase construction. Is there any other ways?
2010/06/09
(cdr '#1='#1#) => ?
What should (cdr '#1='#1#) print?
This started from a blog post by zick (btw, zick is the author of Manga guide to Lisp.)
He reported Gauche segfaults after printing large number of single quotes. Here are results of other lisp/schemes.
- CLISP (2.42) : stack overflow
- Allegro CL (8.2) : (#1='#1#)
- SBCL (1.0.11) : (#1='#1#)
- MzScheme (v372) : #0=((quote . #0#))
I just fixed Gauche at r7170 so it now works just like MzScheme:
gosh> (cdr '#1='#1#) #0=((quote . #0#))
The bug was trivial in hindsight. The write/ss uses two-pass algorithm: first it scans the structure to find shared objects, then it walks the structure again to print. In the second pass it detects the pattern such as (quote X) and using abbreviated notation:
(let/cc return (if (obj-is-shared? obj) (if (first-time-to-print? obj) (begin ... display #n= ...) (begin ... display #n# ... (return)))) (if (and (pair? obj) (null? (cdr obj)) (eq? (car obj) 'quote)) (begin (display #\') (write/ss (cadr obj))) (begin ...general print... ))
When Gauche tries to write cdr of #1=(quote #1#), the first pass marks the cell whose cdr is '() and whose car is a cell #1=(quote #1#) as a shared object (see the figure below). So it starts writing #0=(, then it recognizes the (quote X) pattern; the bug is that it just recurse to (cadr obj), skipping the check of whether (cdr obj) is shared object or not.
The fix is trivial that just add an extra check in the pattern recognition that cdr of (quote X) is not a shared object, which produces #0=((quote . #0#)).
However, Allegro and SBCL prints the structure differently (aside from their number counting from one): Are they just another notation of the same structure? As shown in the figure in zick's blog (I reproduce the figure below), they should be different.
I'm curious that what algorithm produces Allegro's result. I can't imagine one-pass algorithm, so my guess is that instead of having a separate code path for pass 1 and 2, they may use a single code walker (that handles (quote X) pattern detection) and calls it twice with different modes, once for shared-structure detection and again for output. Then it may fail to recognize that the initial object is the same as cdr of (quote X), because recursing into (cadr obj) will skip that particular cell.
Well, I can check it by looking at SBCL but this is so minor problem so I put it off for someday...
Tag: srfi-38
2010/05/14
Supporting general transformers: Step 1
The development of Gauche is a kind of streched bootstrapping process. Initially, the entire VM, compiler, and most of builtin procedures are written in C (with a bit help of STk to generate stub code). That's because I needed a reasonable performance from the beginning to use Gauche in the production; the initial VM was not well tuned, and if I had written a compiler in Gauche from the beginning, the rudimental version would've been too slow for my purpose.
Gradually I rewrote VM (still in C, but most part is written in a sort of DSL using S-expressions), then the compiler (almost entirely written in Gauche now). The initial optimization target of the compiler was the compiler itself, and it worked well. I'm also in the process of gradually rewriting builtin procedures in Scheme, whenever doing so doesn't affect overall performance.
So the VM and the compiler has been rewritten. But there's one component left untouched: The syntax-rules expander. It's a nasty spaghetti of C code I did't dare to touch. There were portable Scheme expanders, but I was afraid that they were not optimal to run on Gauche---I need a fast macro expander, since Gauche needs to compile on the fly. I planned to write an expander from scratch tuned to take advantage of Gauche's runtime.
The change I committed today is the very first step of rewriting macro subsystem. It still uses old syntax-rules expander, but it decouples the expander from the compiler. The compiler used to recognize syntax-rules as a part of define-syntax etc. That is, the syntax-rules form alone didn't mean anything to the compiler:
gosh> (syntax-rules () [(_ x) 'x]) *** ERROR: unbound variable: x
Now syntax-rules itself is a macro, which evaluates to a macro transformer.
gosh> (syntax-rules () [(_ x) 'x]) #<macro #f>
Syntactic bindings such as define-syntax, let-syntax and letrec-syntax are changed so that it evaluates rhs in the compile-time environment (which is supposed to yield a macro transformer) and creates a syntactic binding to the given name.
★ ★ ★
An interesting outcome is that this change officially supports aliasing syntactic/macro keywords.
Gauche doesn't separate compile-time global bindings and runtime global bindings, so evaluating a syntax/macro keyword reveals the syntax handlers or macro transformers as a first-class value.
gosh> if #<syntax if> gosh> let1 #<macro let1>
And you can rebind those transformers to another global variable as if they are runtime bindings:
gosh> (define xif if) ; don't do this! xif gosh> (xif #f (error "oops") 'ok) ; works like if ok
This is an unintended artifact, relying on the fact that Gauche compiles each toplevel form right before executing it in normal mode of operation. This hasn't been encouraged, however, since it mixes phases and will break unexpectedly when the timing of compilation and execution is changed. In fact, it doesn't work if both form is enclosed in a single toplevel form:
gosh> (begin (define yif if) (yif #f (error "oops") 'ok)) *** ERROR: oops
It also doesn't work if the file with those forms are precompiled. (Precompilation is not officially documented, for there are still unreliable behaviors in general cases. But quite a few built-in procedures and the compiler itself are precompiled into arrays of VM instructions).
With today's change, the rhs of define-syntax etc. can be any Scheme expression as far as it yields a macro transformer (or a syntactic handler; I won't go into details of difference of two for now.) Now, this is a proper way to give an alias to a syntactic/macro keyword:
gosh> (define-syntax zif if)
★ ★ ★
I'm still pondering the interface of macro transformers.
Internally, the current implementation uses a procedure that receives a source form and the compile-time enviornment, and returns an expanded form with possible syntactic annotations. But the compile-time enviornment is a private structure to the compiler and I don't want to expose such internal guts to the programmers, so I want to wrap the transformer with some nice abstraction.
A possible choice is the model defined in R6RS--- a macro transformer is a procedure that receives a syntax object and returns a syntax object. Which itself is ok (I can put the source form and compiler environment together into an opaque object). But the more I read the R6RS, the less I want to follow the spec... There are various ways the transformer is called (whether it receives entire form or just a keyword, or whether the keyword can appear in lhs of set!) and the rules to detemine which one is used looks somewhat ad-hoc. I understand those variations are needed to support identifier macros and variable transformers. I just hope they are designed as utility APIs on top of a simpler axiom, something you can say in one sentence.
Actually, it is not very clear to me that R6RS intends those transformer API as the API, or just one of possible APIs. The definition of transformer procedure is not in the main report at all, but rather in the library report. It's as though this particular definition of transformers are attached to (rnrs syntax-case (6)) library. Is an implementation allowed to provide different transformer interface, e.g. explicit renaming one, if the implementation imports other library? It doesn't look straightforward, since define-syntax etc. need to switch the interpretation of transformer procedures depeding on which library is imported. It would be more reasonable that the transfomer library also provides syntactic binding forms, or provide a intermediate macro to translate whatever system-internal transfomer API into the explicit transformer API:
(define-syntax foo (sc-transformer (lambda (expr) (syntax-case expr ....))))
I think MIT-scheme have something like this, and some other implementations too.
Probably I'll take this approach. My current plan is to provide an explicit renaming transformer as a basis, and build a syntax-case one on top of it. Alex Shinn's Chibi-Scheme does that, I believe. The only question is if I can get it fast enough, and that's what I need to implement it to see.
Comment (1)