Gauche Devlog

< Shorter names | Extended formals >

2010/04/29

Regexp read-write invariance

This is one of those things that appear trivial, but when you try to implement them you find a vine is attached that leads to unexpected places.

Gauche adopts a dedicated syntax for regexp. #/regexp/ is read as a literal regexp, that is, a regexp object is contructed by the reader. Then it is only natural that, when a regexp object is written out, it is represented as #/regexp/. In Gauche 0.9 and before, however, it is not guaranteed for a regexp object to have external representation. Such regexp objects are written as #<regexp>.

You may not have noticed that there are such regexps, since regexp objects created by the normal means have been written out in regexp syntax.

gosh> #/(\d\d):(\d\d):(\d\d)/
#/(\d\d):(\d\d):(\d\d)/
gosh> (string->regexp "(\\d\\d):(\\d\\d):(\\d\\d)")
#/(\d\d):(\d\d):(\d\d)/

This is done trivially by saving the source string in a regexp object. You can access this string by regexp->string.

gosh> (regexp->string #/(\d\d):(\d\d):(\d\d)/)
"(\\d\\d):(\\d\\d):(\\d\\d)"

However, there's an undocumented way to create a regexp object procedurally, without composing a string representation. And some of Gauche internals have used that interface, yielding unprintable regexps.

The construction of regexp objects is not an atomic blackbox, but handled in three steps: regexp-parse parses a string representation of regexp and creates an S-expression representing the AST. regexp-optimize transforms an AST, doing some rudimental optimizations. And finally regexp-compile takes an AST and generates an regexp object, which contains a VM instruction sequence that executes NFA of the regexp. (This VM is a specialized one for regexp, not the same one as Gauche's main VM. I'm musing a vague idea to integrate them for some time.)

If a regexp object is created directly from S-expression, it doesn't have a source string, hence it doesn't have an external representation in 0.9.

I finally hit an occasion that I was annoyed with unprintable regexp, and ended up making a procedure regexp-unparse, that takes AST and constructs a string representation of regrexp. This close the circle. Now regexp->string checks if the regexp object has a source string, and if not, it calls regexp-unparse to construct one. Another API regexp-ast returns an AST of the given regexp object.

★ ★ ★

The original plan was to provide alternative ways to denote regexps, specifically, SRE was one of those alternative syntax I had in my mind. The ways to plug in more sophisiticated regexp optimizer written in Scheme was another plan.

The current AST format is something I hacked up quickly and I've been hoping to come back to redesign it. That's why I've procrastinated documenting regexp-parse etc.

But after finishing regexp-unparse and looking back, it seems to me that the current AST is good enough, although not perfect. Providing the API set to manipulate regexps may open up interesting possibilities for users. So I decided to make it public in 0.9.1.

I'm working on the documentation, but here's a summary of regexp "under-the-hood" API.

  • regexp-parse string :key case-fold => ast
  • regexp-optimize ast => ast
  • regexp-compile ast => regexp
  • regexp-ast regexp => ast
  • regexp-unparse ast :key (on-error :error) => string

Tags: 0.9.1, regexp

Post a comment

Name: