2010/04/29

Regexp read-write invariance

This is one of those things that appear trivial, but when you try to implement them you find a vine is attached that leads to unexpected places.

Gauche adopts a dedicated syntax for regexp. #/regexp/ is read as a literal regexp, that is, a regexp object is contructed by the reader. Then it is only natural that, when a regexp object is written out, it is represented as #/regexp/. In Gauche 0.9 and before, however, it is not guaranteed for a regexp object to have external representation. Such regexp objects are written as #<regexp>.

You may not have noticed that there are such regexps, since regexp objects created by the normal means have been written out in regexp syntax.

gosh> #/(\d\d):(\d\d):(\d\d)/
#/(\d\d):(\d\d):(\d\d)/
gosh> (string->regexp "(\\d\\d):(\\d\\d):(\\d\\d)")
#/(\d\d):(\d\d):(\d\d)/

This is done trivially by saving the source string in a regexp object. You can access this string by regexp->string.

gosh> (regexp->string #/(\d\d):(\d\d):(\d\d)/)
"(\\d\\d):(\\d\\d):(\\d\\d)"

However, there's an undocumented way to create a regexp object procedurally, without composing a string representation. And some of Gauche internals have used that interface, yielding unprintable regexps.

The construction of regexp objects is not an atomic blackbox, but handled in three steps: regexp-parse parses a string representation of regexp and creates an S-expression representing the AST. regexp-optimize transforms an AST, doing some rudimental optimizations. And finally regexp-compile takes an AST and generates an regexp object, which contains a VM instruction sequence that executes NFA of the regexp. (This VM is a specialized one for regexp, not the same one as Gauche's main VM. I'm musing a vague idea to integrate them for some time.)

If a regexp object is created directly from S-expression, it doesn't have a source string, hence it doesn't have an external representation in 0.9.

I finally hit an occasion that I was annoyed with unprintable regexp, and ended up making a procedure regexp-unparse, that takes AST and constructs a string representation of regrexp. This close the circle. Now regexp->string checks if the regexp object has a source string, and if not, it calls regexp-unparse to construct one. Another API regexp-ast returns an AST of the given regexp object.

★ ★ ★

The original plan was to provide alternative ways to denote regexps, specifically, SRE was one of those alternative syntax I had in my mind. The ways to plug in more sophisiticated regexp optimizer written in Scheme was another plan.

The current AST format is something I hacked up quickly and I've been hoping to come back to redesign it. That's why I've procrastinated documenting regexp-parse etc.

But after finishing regexp-unparse and looking back, it seems to me that the current AST is good enough, although not perfect. Providing the API set to manipulate regexps may open up interesting possibilities for users. So I decided to make it public in 0.9.1.

I'm working on the documentation, but here's a summary of regexp "under-the-hood" API.

regexp-parse string :key case-fold => ast
regexp-optimize ast => ast
regexp-compile ast => regexp
regexp-ast regexp => ast
regexp-unparse ast :key (on-error :error) => string

Tags: 0.9.1, regexp

Comments (0)

2010/04/28

#permalink

Shorter names

People have different opinions on terse code. I generally agree with Paul Graham's "Succinctness is Power", but I'm less bold than him and tend to take more cautious steps; I'm afraid that optimizing for small domain of applications may put us into a local maximum. For example, APL may be at such a local maximum; probably no other language can beat it on succinctness in the field APL is for, but extending it to broader applications looks almost impossible. (Disclaimer: I've never programmed in APL, just skimmed through several documentations. I may be wrong.)

There are certain things I can introduce to make code shorter without deviating from standard Scheme syntax, and actually I have been trying out some. For 0.9.1, I chose (at this moment) two of such ideas to support officially.

~ : a universal accessor.

One of the pervasive verbosities I see in my Lisp/Scheme code is accessors. Lisp tends to give accessors descriptive names, resulting code that looks like this:

  (slot-ref (vector-ref (hash-table-get pvhash 'prop) 0) 'name)

It has a virtue that the reader can guess pvhash is a hashtable that contains vectors of objects, without type declaration or anything. But... I often feel envy to other languages in which I may be able to write something like this:

  pvhash['prop'][0].name

Gauche already has a generic function ref defined on most aggregate types, so the above expression can be written as follows:

   (ref (ref (ref pvhash 'prop) 0) 'name)

It is shorter, but it doesn't feel much shorter. Maybe it's because the number of nodes or nesting levels remains the same.

Issac Trotts has suggested ref*, which allows chaining. (ref* a b c) = (ref* (ref* a b) c) etc. It hasn't been officially documented but in Gauche since 2006. Using ref*, the expression gets even shorter.

   (ref* pvhash 'prop 0 'name)

However, I found myself not using ref* much. Most of the places where I use accessor, it is not chained, so ref is shorter. And I felt it cumbersome to switch to a ref* when I found I need chanining.

So I looked for shorter abbreviation of ref*. (. x y) will be easier to understand from the convention of C family, but that required changing S-expression syntax (Clojure took this notation, though). (-> x y) and (.. x y) are also easily understandable, but those requires two characters. Ideally I want to cut it down to a single character. I tried (@ x y) for several projects, but ultimately found it stood out too much.

After all, I settled down to ~. There are not many one-character punctuations that are usable, so it is rather chosen by elimination. I think it not bad once you get used to, but if you find it doesn't look good, well, you can always use ref. The above example comes down to this:

   (~ pvhash 'prop 0 'name)

Which is only two characters wider than my hypothetical example:

   pvhash['prop'][0].name

For slot accesses, now I tend to drop the space before the quote of the slot name:

   (~ obj'slot1'slot2'slot3)

Which I think is not bad compared to these:

   obj.slot1.slot2.slot3
   obj->slot1->slot2->slot3

Oh, and note also that generalized setter works.

   (set!  (~ pvhash 'prop 0 'name) "newname")

This comes very handy when you have to write OO-ish code (I mean, network of mutable objects).

^: an alias of lambda.

Quack has a nice feature to display lambda as λ, but that's an illusion. Your code still contains six-character lambda and opening the code with other editors reveals that. Seasoned Lisper/Schemers are trained to recognize these six characters as one chunk, so it may not be much a problem for reading, but still it takes screen estate. I'm an old type who feels awkward when code doesn't fits in 80-columns. Having six characters for lambda tends to make my line longer which makes me insert more line breaks, which resulting vertically streched code. Arrrggg.

Gauche doesn't have a problem to treat λ as an alias of lambda, but typing λ is not very convenient, probably except for Greeks.

So again, I looked for punctuation characters I can use. This time ^ seemed a good choice. Actually, the use of λ for functional abstraction originally came from a caret, I was told.

Combined with ~, I found I could cram more logic to a single line.

(fold (^(block sum) (+ (or (~ block'recv-total) 0) sum)) 0 (~ db'blocks))

I also define ^a, ^b, ... ^z, ^_ as a macro for the case of single argument function. That is, (^p (string? (cdr p))) is the same as (lambda (p) (string? (cdr p))). (Note that in this case I can't use cut (ref:cut)).

I've been testing these for, well, probably more than two years, and I feel they have a merit. They can also be implemented relatively easily in portable Scheme, so they don't break portability significantly (compared to introducing a new syntax, such as Clojure's use of vectors as shorthand notation of closures).

Tags: 0.9.1, ~, ^

Comments (8)

2010/04/27

#permalink

Blowfish password hashing

Ever since I read this article on password hashing, which was more than two years ago, I wanted to bundle a proper password hashing library to Gauche. I finally managed to put bcrypt in, as crypt.bcrypt module. (ref:crypt.bcrypt : This link will be valid after 0.9.1 release).

The bcrypt library does password encryption with adaptive hashing, using blowfish algorithm. (Do not confuse it with another program also called bcrypt, which also uses blowfish but it encrypts files.) They say it is the default password hashing used on OpenBSD. Most popular lightweight languages have an extension package to use bcrypt password hashing, e.g. bcrypt-ruby for Ruby, py-bcrypt for Python, or Crypt::Eksblowfish::Bcrypt for Perl.

I decided to include crypt.bcrypt in the Gauche distribution instead of making it an optional extension package. Gauche has sys-crypt (ref:sys-crypt) which is a direct interface to system's crypt(3) function, and also a few popular hash functions such as MD5 and SHA1/SHA2 (ref:rfc.md5, ref:rfc.sha). If crypt.bcrypt is something you have to install separately, it would be very tempting to use one of those coming with Gauche by default instead. I confess I have used MD5 before for an web app.

I'm totally convinced by the Thomas Ptacek's article I linked at the top of this post; original unix crypt(3) or even MD5 or SHA shouldn't be used for password hashing anymore. It's not so much because MD5 is a weak hash; it's because password hashing requires different property than data hashing schemes such as SHA. See the article for more details. But even if somebody doesn't agree or doesn't care, there's not much reason to use weaker schemes if the proper hashing library also comes by default.

I took bcrypt code from http://www.openwall.com/crypt/ , which is provided in public domain. So it doesn't add any external dependency.

Using bcrypt is very easy. You only need to remember just one function, bcrypt-hashpw. To calculate a password hash, call it with the password string:

  (bcrypt-hashpw password-string)  => hashed-string

The library automatically takes care of salting. To check if the given password is correct, pass the given password and hashed string. If returned string matches the provided hashed string, the password is correct.

  (bcrypt-hashpw password-string hashed-string) => hashed-string

There's also a function that you can calcurate a secure salt if you want.

Tags: 0.9.1, crypt.bcrypt

Comments (0)

2010/04/26

#permalink

Enhanced queue

util.queue module (ref:util.queue) has been completely rewritten for 0.9.1.

The main motivation is to support thread-safe queue (mtqueue). Last couple of years I had quite a few projects that needed to fully utilize multiple cores, and for almost every one of them I ended up writing some kind of thread-safe queue. It is a simple but robust way as a synchronization primitive. Naturally the feature is a strong candidate for Gauche to provide natively.

Writing a thread-safe queue is seemingly a simple task. Yet it took some time for me to come up an API I feel right.

Just protecting access to a queue by a mutex makes the queue thread-safe, but it doesn't make the queue usable for synchronization; we want a consumer thread to block when it finds the queue is empty, and to resume automatically as soon as some data is available in the queue.

Providing a simple-minded thread-safety, that just protects access by a mutex, isn't a good strategy. It doesn't help much to build synchronization operations, since the latter needs a condition variable to cooperate with the mutex of the queue. Such a strategy is a dead-end of abstraction; it does abstract something, but does not useful to build more abstractions on top of it.

So, we use a condition variable and let a consumer thread block on an empty queue. Is that enough? Not really.

If a producer thread fails for some reason, the consumer thread would wait indefinitely. It is usually unacceptable in the production code; the consumer thread must detect abnormality and try to recover, report to supervisor, or shut itself down gracefully. There's also a situation that a consumer thread fails; in such case a producer thread would just keep pushing the data into the queue until it exhausts memory, which is also undesirable.

I explored some options. In one option, a queue had low and high water marks, and a callback was called when the length goes below the low mark or above the high mark, and it adjusted the pace of either a consumer or a producer threads. It worked for the project, but I felt it too complicated for general use.

Finally I ended up on a simple concept of timeouts, plus optional upper-bound of the queue length. The API is parallel to other timeout-able thread operations so that it's easy to remember, and I think it is general enough so that more sophisticated stuff can be built on top of it.

Here's an outline of synchronizing mtqueue operations. See the manual in svn trunk if you are curious to know the details.

make-mtqueue :key max-length
Returns a thread-safe queue, with optional upper bound of the queue length.
dequeue/wait! mtqueue :optional (timeout #f) (timeout-val #f)
Pop an item from mtqueue, or block if the queue is empty, with optional timeout. The semantics of timeout and timeout-val are the same as thread-join! etc. (ref:thread-join!).
enqueue/wait! mtqueue obj :optional (timeout #f) (timeout-val #f)
Push obj into the queue, or block if the queue has maximum number of elements, with optional timeout.

Note that there are also dequeue! and enqueue! that do not block. They work on both mtqueue and normal (non-thread-safe, but cheap) queue. dequeue! throws an error by default if the queue is empty, though you can suppress it by providing a fallback value.

Tags: 0.9.1, util.queue

Comments (0)

2010/04/26

#permalink

Better REPL?

Gosh's default REPL provides bare minimum features. I almost always run gosh inside Emacs and usually it's fine, but sometimes I do feel it can be better.

So I'm experimenting ideas here: http://gauche.svn.sourceforge.net/viewvc/gauche/Gauche-scripts/trunk/xrepl/

Here are some stuff I have. I'll use it for a while to see if it's worth to be included into gosh or not. Other ideas and feedbacks are welcome.

Prettyprinting the result

It has rudimental pretty printer (which is a bit dumb; it may take exponential time for large S-expr that I need to fix). Ideally pretty printer should be a part of Gauche, integrated with built-in format etc.

gosh> (call-with-input-string
          (values-ref (http-get "blog.practical-scheme.net" "/gauche") 2)
        (cut ssax:xml->sxml <> '()))
(*TOP*
 (html
  (head
   (title "Gauche Devlog")
   (base (|@| (href "http://blog.practical-scheme.net/gauche")))
   (link
    (|@| (type "application/rss+xml") (title "RSS") (rel "alternate")
     (href "http://blog.practical-scheme.net/gauche?c=rss")))
   (link (|@| (type "text/css") (rel "stylesheet") (href "blog.css")))
   (link (|@| (type "text/css") (rel "stylesheet") (href "gauche-devlog.css"))))
  (body
...

History

Accessing recent evaluation results. This feature has been sitting long in my wishlist, but I couldn't make up my mind of what symbols I would use; CL uses *, **, ***, ... but in Scheme * would conflict with the built-in function. I finally decided to adapt Clojure's convention; *1 refers to the last evaluated result, *2 for the second last, etc.

gosh> *1
(*TOP*
 (html
  (head
   (title "Gauche Devlog")
...
gosh> (length (cadr *1))
3

The procedure *history prints the list of history.

*1, *2, ... only keeps the first value of the result(s). In case if we have more than one values, *1+, *2+, ... keeps the list of values.

If the last evaluation ends as an error, the condition object is bound to *e.

gosh> (car 3)
*** ERROR: pair required, but got 3
Stack Trace:
_______________________________________
  0  (with-error-handler (lambda (e) (let ((e e)) (%guard-rec e e (else ...
        [unknown location]
gosh> *e
#<error "pair required, but got 3">

Shell-like commands

You can type shell commands by leading it with a comma:

gosh> ,cd ~/src/Gauche
/home/shiro/src/Gauche
gosh> ,ls
AUTHORS       README          configure.ac       ltmain.sh
COPYING       VERSION         configure.ac.orig  m4/
ChangeLog     acinclude.m4    doc/               missing*
DIST*         aclocal.m4      examples/          mkinstalldirs*
DIST_EXCLUDE  config.guess*   ext/               rpmfiles-common.txt
Gauche.spec   config.log      framework.sh*      rpmfiles-encoding.txt
HACKING       config.rpath*   gc/                rpmfiles-gdbm.txt
INSTALL.in    config.status*  gc.diff            src/
Makefile      config.sub*     install-sh*        test/
Makefile.in   config.threads  lib/               test.record
NEWS          configure*      libsrc/            winnt/
gosh> ,make -j
for d in gc src lib ext doc; do (cd $d; make all); done
make[1]: Entering directory `/home/shiro/src/Gauche/gc'
...

Sometimes I'm too lazy to switch from *scheme* buffer to *shell* buffer (or, more like that I forget to switch and am annoyed by getting 'unbound variable: ls').

It would be nicer if I can easily grab the shell output into Scheme variable, and feed the content of Scheme variable into shell commands.

Smarter (or verbose) error handling

I'm not sure I'll keep this one, but giving a try.

"." isn't in *load-path* by default, for the obvious security reasons. But sometimes I forget to type "./foo.scm" and being annoyed, for I know foo.scm in the current dir is safe. So, here's a little help...

gosh> (load "t.scm")
*** REPL: Cannot load file "t.scm" in *load-path* ("../lib" "../libsrc" "../src" "/usr/share/gauche/site/lib" "/usr/share/gauche/0.9.1_pre1/lib" "/usr/share/gauche/0.9/lib" "/home/shiro/src/Gauche-scripts/xrepl/")
But you have the file under the current directory.
Do you want to add "." in *load-path*? (y/n):

This is actually implemented in more general "error filter" framework. When REPL gets an error, a series of registered tests are run, and if there's a match, an associated action is invoked.

This kind of thing may be too annoying (in fact, I don't even like CL implementations entering the debugger on error by default.) But I'll see.

Tag: repl

Comments (0)

More entries ...

Gauche Devlog

About

Author

Recent Entries

Recent Comments

Recent Releases

2010/04/29

Regexp read-write invariance

2010/04/28

Shorter names

~ : a universal accessor.

^: an alias of lambda.

2010/04/27

Blowfish password hashing

2010/04/26

Enhanced queue

2010/04/26

Better REPL?

Prettyprinting the result

History

Shell-like commands

Smarter (or verbose) error handling