Gauche Devlog

2010/05/01

Records and util.match

This is not a new feature for 0.9.1; it's been there for long time. But I rediscover it today with a pleasant surprise, so it may be worth to note about it, for I expect to use records more, as srfi-99 is supported in 0.9.1 (cf. ERR5RS Records and beyond).

The pattern matcher util.match can be used to match records. The feature already existed in the original Andrew Wright's match, and I believe it's available for many other Scheme implementations.

(use gauche.record)
(use util.match)
(use math.const)  ; for pi

(define-record-type circle #t #t radius)

(define-record-type rectangle #t #t width height)

(define (area shape)
  (match shape
    [($ circle r) (* pi r r)]
    [($ rectangle w h) (* w h)]))

gosh> (area (make-circle 5))
78.53981633974483
gosh> (area (make-rectangle 4 8))
32

($ class var ...) matches the instance of class, with binding slot values to var ... in order.

This pattern can be used for any instances, nevertheless I haven't used it much before. I care less about the order of slots when I think in terms of classes. Slots are always accessed by thier name, and I freely change the order of slots in class definitions, so it's a bit cumbersome to have patterns that depend on the order of slots.

However, records' slots feel more positional---maybe because their default constructor takes positional arguments to initialize slots. Also their emphasis of immutability reminds me the style of other functional languages.

If I find myself using this pattern more and more, it'll be worth to optimize it. Currently it looks up n-th slot name in the list of slots of the matching object, then uses slot-ref to obtain a value. For general objects we need to do that, since associations of the position and the slot can be changed by class redefinition. But for records we don't need to worry about class redefinition, and internally it's much faster to access records' slot via its position instead of its name.

Tags: gauche.record, srfi-99, util.match

2010/04/30

Extended formals

One thing I miss most when I hop back to Scheme from CL is CL's lambda list for optional and keyword arguments. I feel the CL's spec is too complicated for my taste (CLHS specifies 10 different kind of lambda lists), nevertheless I know it is useful.

Gauche has been providing argument parsing utility macros, let-optionals*, let-keywords and let-keywords* (ref:let-optionals*, ref:let-keywords*). These cover enough functionalities to deal with optional and keyword arguments. Yet they are different from being able to specify those arguments directly in the formals. One thing is that those macros makes code longer. Another thing is that I feel specifying them directly in the formals somewhat makes them more a part of the public contract of the procedure. They stand out in the source code, claiming that the procedure takes two required argments and three optional arguments, something like that. It makes easier to read the source. Which is precious.

So I've been secretly experimenting the CL-like extended formal list for almost two years. Now I'm convinced that it has enough advantages to be in officially.

Actually, there's a SRFI for extended formals (srfi:89). It introduces new forms, define* and lambda*, that recognize extended syntax for optional and named (keyword) arguments.

Providing new forms is a polite way---it leaves original Scheme intact, and won't step on existing code accidentally. It is highly desirable for a portable library, and understandable that srfi-89 took that path.

However, for this feature, I rather opted to extend define and lambda. Support of extended formals is an upper compatible change (I mean, even with this extention, proper R5RS programs runs just fine), and having different forms only to be polite makes the language unnecessarily complex. Another important advantage is that, by extending existing forms, extended formals will be available for the macros that expands into defines or lambdas. If we were to have different forms, we'd need to change such macros around to make the extended feature available.

This is a kind of decision highly depends on the target of the implementation. I think it is bad to conflate standard syntax if the implementation is for education; students could confuse the language itself and the implementation's specifics. But that's not the target of Gauche.

For what's worth, this extension is attached to define and lambda of the gauche module. The original semantics is still kept if you import null module (corresponds to (null-environment)) or scheme module (corresponds to (scheme-report-enviornment 5)). In other words, Gauche's lambda has different syntactic binding from R5RS's lambda.

★ ★ ★

Gauche's extended formal syntax is similar to Common Lisp's, but I use keywords :optional, :key, :allow-other-keys and :rest instead of CL's &optional, &key, &allow-other-keys and &rest. I saw no point adding extra reserved symbols.

(define (foo x :optional  (y 0) (z 1)) (list x y z))

gosh> (foo 9)
(9 0 1)
gosh> (foo 9 10)
(9 10 1)
gosh> (foo 9 10 11)
(9 10 11)
gosh> (foo 9 10 11 12)
*** ERROR: too many arguments for (lambda (x :optional (y 0) (z 1)) (list x y z))
(define (foo x :key (y 0) (z 1)) (list x y z))

gosh> (foo 9 :z -1)
(9 0 -1)
gosh> (foo 9 :z -1 :zz 3)
*** ERROR: unknown keyword :zz
(define (foo :key (x 0) :allow-other-keys) x)

gosh> (foo :z 9)
0
gosh> (foo :x 8 :y 9)
8

If no default value is given, the variable is bound to #<undef>. (which can be tested with undefined?, but that hasn't been documented. I'll make it public in 0.9.1, too.)

(define (foo :key x) x)

gosh> (foo)
#<undef>

#<undef> is first-class value, so we can't be sure if the argument isn't provided, or the argument is provided but its value happens to be #<undef>. CL solves this problem by allowing extra parameter, supplied-p-parameter, that binds to a boolean value indicating whether the argument is provided. Gauche doesn't support that feature yet.

Internally, these lambdas with extended formals are expanded into the base lambdas and let-optionals*/let-keywords*.

One twist I added is an optional parameter after :allow-other-keys.

(define (foo :key x y :allow-other-keys others) 
   (list x y others))

The parameter others is bound to a keyword-value list that didn't consumed by :key parameters.

gosh> (foo :w 0 :x 1 :y 2 :z 3)
(1 2 (:z 3 :w 0))

It is handy to a procedure that wraps another procedure, and that wants to filter out whatever extra keyword argument it gets.

★ ★ ★

After implementing extended formal list, I rewrote argument parsing macros in my code with this feature. Interestingly, I found let-keywords* etc. weren't obsoleted competely by extended formals. They are still useful when you factor out common option processing:

(define (some-api x y z . options)
  (check-common-options options)
  ...)

(define (another-api a b . options)
  (check-common-options options)
  ...)

(define (check-common-options options)
  (let-keywords* options ((key1 init1) (key2 init2) ...)
    ...))

In CL, I would use destructuring-bind. But lack of that, it has a merit to have argument parsing feature separately from lambda syntax.

Tags: 0.9.1, formals, define, lambda

2010/04/29

Regexp read-write invariance

This is one of those things that appear trivial, but when you try to implement them you find a vine is attached that leads to unexpected places.

Gauche adopts a dedicated syntax for regexp. #/regexp/ is read as a literal regexp, that is, a regexp object is contructed by the reader. Then it is only natural that, when a regexp object is written out, it is represented as #/regexp/. In Gauche 0.9 and before, however, it is not guaranteed for a regexp object to have external representation. Such regexp objects are written as #<regexp>.

You may not have noticed that there are such regexps, since regexp objects created by the normal means have been written out in regexp syntax.

gosh> #/(\d\d):(\d\d):(\d\d)/
#/(\d\d):(\d\d):(\d\d)/
gosh> (string->regexp "(\\d\\d):(\\d\\d):(\\d\\d)")
#/(\d\d):(\d\d):(\d\d)/

This is done trivially by saving the source string in a regexp object. You can access this string by regexp->string.

gosh> (regexp->string #/(\d\d):(\d\d):(\d\d)/)
"(\\d\\d):(\\d\\d):(\\d\\d)"

However, there's an undocumented way to create a regexp object procedurally, without composing a string representation. And some of Gauche internals have used that interface, yielding unprintable regexps.

The construction of regexp objects is not an atomic blackbox, but handled in three steps: regexp-parse parses a string representation of regexp and creates an S-expression representing the AST. regexp-optimize transforms an AST, doing some rudimental optimizations. And finally regexp-compile takes an AST and generates an regexp object, which contains a VM instruction sequence that executes NFA of the regexp. (This VM is a specialized one for regexp, not the same one as Gauche's main VM. I'm musing a vague idea to integrate them for some time.)

If a regexp object is created directly from S-expression, it doesn't have a source string, hence it doesn't have an external representation in 0.9.

I finally hit an occasion that I was annoyed with unprintable regexp, and ended up making a procedure regexp-unparse, that takes AST and constructs a string representation of regrexp. This close the circle. Now regexp->string checks if the regexp object has a source string, and if not, it calls regexp-unparse to construct one. Another API regexp-ast returns an AST of the given regexp object.

★ ★ ★

The original plan was to provide alternative ways to denote regexps, specifically, SRE was one of those alternative syntax I had in my mind. The ways to plug in more sophisiticated regexp optimizer written in Scheme was another plan.

The current AST format is something I hacked up quickly and I've been hoping to come back to redesign it. That's why I've procrastinated documenting regexp-parse etc.

But after finishing regexp-unparse and looking back, it seems to me that the current AST is good enough, although not perfect. Providing the API set to manipulate regexps may open up interesting possibilities for users. So I decided to make it public in 0.9.1.

I'm working on the documentation, but here's a summary of regexp "under-the-hood" API.

  • regexp-parse string :key case-fold => ast
  • regexp-optimize ast => ast
  • regexp-compile ast => regexp
  • regexp-ast regexp => ast
  • regexp-unparse ast :key (on-error :error) => string

Tags: 0.9.1, regexp

2010/04/28

Shorter names

People have different opinions on terse code. I generally agree with Paul Graham's "Succinctness is Power", but I'm less bold than him and tend to take more cautious steps; I'm afraid that optimizing for small domain of applications may put us into a local maximum. For example, APL may be at such a local maximum; probably no other language can beat it on succinctness in the field APL is for, but extending it to broader applications looks almost impossible. (Disclaimer: I've never programmed in APL, just skimmed through several documentations. I may be wrong.)

There are certain things I can introduce to make code shorter without deviating from standard Scheme syntax, and actually I have been trying out some. For 0.9.1, I chose (at this moment) two of such ideas to support officially.

~ : a universal accessor.

One of the pervasive verbosities I see in my Lisp/Scheme code is accessors. Lisp tends to give accessors descriptive names, resulting code that looks like this:

  (slot-ref (vector-ref (hash-table-get pvhash 'prop) 0) 'name)

It has a virtue that the reader can guess pvhash is a hashtable that contains vectors of objects, without type declaration or anything. But... I often feel envy to other languages in which I may be able to write something like this:

  pvhash['prop'][0].name

Gauche already has a generic function ref defined on most aggregate types, so the above expression can be written as follows:

   (ref (ref (ref pvhash 'prop) 0) 'name)

It is shorter, but it doesn't feel much shorter. Maybe it's because the number of nodes or nesting levels remains the same.

Issac Trotts has suggested ref*, which allows chaining. (ref* a b c) = (ref* (ref* a b) c) etc. It hasn't been officially documented but in Gauche since 2006. Using ref*, the expression gets even shorter.

   (ref* pvhash 'prop 0 'name)

However, I found myself not using ref* much. Most of the places where I use accessor, it is not chained, so ref is shorter. And I felt it cumbersome to switch to a ref* when I found I need chanining.

So I looked for shorter abbreviation of ref*. (. x y) will be easier to understand from the convention of C family, but that required changing S-expression syntax (Clojure took this notation, though). (-> x y) and (.. x y) are also easily understandable, but those requires two characters. Ideally I want to cut it down to a single character. I tried (@ x y) for several projects, but ultimately found it stood out too much.

After all, I settled down to ~. There are not many one-character punctuations that are usable, so it is rather chosen by elimination. I think it not bad once you get used to, but if you find it doesn't look good, well, you can always use ref. The above example comes down to this:

   (~ pvhash 'prop 0 'name)

Which is only two characters wider than my hypothetical example:

   pvhash['prop'][0].name

For slot accesses, now I tend to drop the space before the quote of the slot name:

   (~ obj'slot1'slot2'slot3) 

Which I think is not bad compared to these:

   obj.slot1.slot2.slot3
   obj->slot1->slot2->slot3

Oh, and note also that generalized setter works.

   (set!  (~ pvhash 'prop 0 'name) "newname")

This comes very handy when you have to write OO-ish code (I mean, network of mutable objects).

^: an alias of lambda.

Quack has a nice feature to display lambda as λ, but that's an illusion. Your code still contains six-character lambda and opening the code with other editors reveals that. Seasoned Lisper/Schemers are trained to recognize these six characters as one chunk, so it may not be much a problem for reading, but still it takes screen estate. I'm an old type who feels awkward when code doesn't fits in 80-columns. Having six characters for lambda tends to make my line longer which makes me insert more line breaks, which resulting vertically streched code. Arrrggg.

Gauche doesn't have a problem to treat λ as an alias of lambda, but typing λ is not very convenient, probably except for Greeks.

So again, I looked for punctuation characters I can use. This time ^ seemed a good choice. Actually, the use of λ for functional abstraction originally came from a caret, I was told.

Combined with ~, I found I could cram more logic to a single line.

(fold (^(block sum) (+ (or (~ block'recv-total) 0) sum)) 0 (~ db'blocks))

I also define ^a, ^b, ... ^z, ^_ as a macro for the case of single argument function. That is, (^p (string? (cdr p))) is the same as (lambda (p) (string? (cdr p))). (Note that in this case I can't use cut (ref:cut)).

I've been testing these for, well, probably more than two years, and I feel they have a merit. They can also be implemented relatively easily in portable Scheme, so they don't break portability significantly (compared to introducing a new syntax, such as Clojure's use of vectors as shorthand notation of closures).

Tags: 0.9.1, ~, ^

2010/04/27

Blowfish password hashing

Ever since I read this article on password hashing, which was more than two years ago, I wanted to bundle a proper password hashing library to Gauche. I finally managed to put bcrypt in, as crypt.bcrypt module. (ref:crypt.bcrypt : This link will be valid after 0.9.1 release).

The bcrypt library does password encryption with adaptive hashing, using blowfish algorithm. (Do not confuse it with another program also called bcrypt, which also uses blowfish but it encrypts files.) They say it is the default password hashing used on OpenBSD. Most popular lightweight languages have an extension package to use bcrypt password hashing, e.g. bcrypt-ruby for Ruby, py-bcrypt for Python, or Crypt::Eksblowfish::Bcrypt for Perl.

I decided to include crypt.bcrypt in the Gauche distribution instead of making it an optional extension package. Gauche has sys-crypt (ref:sys-crypt) which is a direct interface to system's crypt(3) function, and also a few popular hash functions such as MD5 and SHA1/SHA2 (ref:rfc.md5, ref:rfc.sha). If crypt.bcrypt is something you have to install separately, it would be very tempting to use one of those coming with Gauche by default instead. I confess I have used MD5 before for an web app.

I'm totally convinced by the Thomas Ptacek's article I linked at the top of this post; original unix crypt(3) or even MD5 or SHA shouldn't be used for password hashing anymore. It's not so much because MD5 is a weak hash; it's because password hashing requires different property than data hashing schemes such as SHA. See the article for more details. But even if somebody doesn't agree or doesn't care, there's not much reason to use weaker schemes if the proper hashing library also comes by default.

I took bcrypt code from http://www.openwall.com/crypt/ , which is provided in public domain. So it doesn't add any external dependency.

Using bcrypt is very easy. You only need to remember just one function, bcrypt-hashpw. To calculate a password hash, call it with the password string:

  (bcrypt-hashpw password-string)  => hashed-string

The library automatically takes care of salting. To check if the given password is correct, pass the given password and hashed string. If returned string matches the provided hashed string, the password is correct.

  (bcrypt-hashpw password-string hashed-string) => hashed-string

There's also a function that you can calcurate a secure salt if you want.

Tags: 0.9.1, crypt.bcrypt

More entries ...