2011/11/20

Two-argument reverse

Again I kept silence for over 4 months. I released 0.9.2 in August. Since then, quite a few changes are made into the development HEAD, and I'd like to start explaining them as I approach to release 0.9.3.

Some of the changes are small in the code, but can have great impact---meaning, it may change how you code in Gauche. However, they take a lot of words to explain; I need some more time.

For today, I pick a rather small topic to get up to the speed. It may even seem too trivial to mention, but somehow, it matters to me. Maybe because it is related to how I approach to a program.

★ ★ ★

A common pattern to transform a list is to tail-recurse the input with consing the results, and at the end reverse the result list to return:

  (let loop ([xs xs] [rs '()])
    (if (null? xs)
      (reverse rs)
      (loop (cdr xs) (cons (do-something (car xs)) rs))))

It is not necessarily a good style. Olin Shivers commented in his srfi-1 reference implementation that, this style is just trading execution stack for cons cells, and in stack-based system you should rather use a natural (non-tail) recursion:

  (let rec ([xs xs])
    (if (null? xs)
      '()
      (cons (do-something (car xs)) (rec (cdr xs)))))

In Gauche, the latter is just as fast as the former to a list up to about a thousand elements. However, if the input exceeds the limit and recursion gets too deep, Gauche starts moving the stack frames to the heap (a kind of stack GC), so it gets worse than using intermediate lists and reverse at the end. So I typically use cons-then-reverse idiom in Gauche.

Now, it is also common to write a function that adds the result to the existing pile of data. If we do non-tail recursive calls, it is trivial; I just replace () with the existing pile of data:

  (let rec ([xs xs])
    (if (null? xs)
      existing-pile-of-data
      (cons (do-something (car xs)) (rec (cdr xs)))))

If we use tail-recursive call and reverse, however, we need to reverse the accumulated results, then append the existing pile of data. Srfi-1 has a procedure specifically for this pattern, append-reverse:

  (let loop ([xs xs] [rs '()])
    (if (null? xs)
      (append-reverse rs previous-results)
      (loop (cdr xs) (cons (do-something (car xs)) rs))))

There's nothing wrong with this code; the append-reverse procedure is meant to avoid unnecessary copying. Except that I always feel something awkward in the name: I feel it somewhat miss the point by thinking the operation as "reverse, then append".

Look at this:

(reverse xs)             == (fold cons '() xs)
(append-reverse xs init) == (fold cons init xs)

It looks like reverse is really a special case of append-reverse with the init to be (). Thinking the operation as "reverse, then append" is actually upside-down. It may be more natural to think the operation represented by append-reverse as a fundamental one, and reverse is a shorthand for the special case.

Then, wouldn't it be reasonable to extend reverse this way?

(define (reverse xs :optional (tail '()))
  (fold cons tail xs))

If you stick to the idea that reverse is an operation to reverse the order of the given list, this extension seems contrived. However, how about thinking reverse as a list constructor, which happens to take initial content from xs but in the reverse order?

Actually, 0.9.2's reverse is already extended to take an optional tail argument, though it's not documented.

Tradition has great inertia, so I don't expect this extension gets widely accepted; after all, you can use srfi-1's append-reverse if you want this functionality. But I did get a little piece of mind with this extension.

Tags: reverse, srfi-1, append-reverse

Comments (0)

2011/07/18

#permalink

Unicode Case-mapping Support and Character Set Independence

In my local branch, I've been working on enhancing case-mapping procedures such as char-upcase and string-downcase to cover the entire character range. I expect I can merge it into the master branch soon.

As of 0.9.1 Gauche's case mapping procedures only handles ASCII range, and pass through any characters beyond it. But R6RS and R7RS (draft) both define those case-mapping procedures in terms of Unicode definitions, so it's about time to modernize Gauche's character handling as well.

It isn't much trouble to generate tables from Unicode character database, although there's room to optimize them for speed and/or space (I'll post another entry to explain the strategy I took).

The issues

The trouble is that Gauche is designed to be independent from the internal character encoding scheme as much as possible. While the internal encoding needs to be chosen at compile-time, the code that depends on a specific encoding is small and mostly confined in gauche/char_*.h files.

Case-mapping adds nontrivial amount of code; it's not just a simple table of character-to-character mapping. In string case-mapping (e.g. string-downcase) each character's mapping is context-dependent, and to properly map the cases, we need to find word boundaries. That means we have to implement text segmentation, which is defined in UAX #29. That, in turn, depends on the character property database.

(The complication of this issue may be illustrated by this:

gosh> (string-downcase "ΧΑΟΣΧΑΟΣ.ΧΑΟΣ. Σ.")
"χαοσχαοσ.χαος. σ."

According to UAX#29 and R6RS, only the third sigma should take the final form.)

If I want to keep encoding-dependent code separate, I need to prepare those tables for each supported encodings and choose one at the compile time according to the configuration option of internal encoding. The tables can be generated mechanically, but having more code that selectively used by compiler options means more testing and maintenance headache. I don't like it.

Furthermore, full support of Unicode-defined algorithms such as text segmentation and normalization are useful even if Gauche's internal encoding is not utf-8. For example, if Gauche's internal encoding is euc-jp, then only a subset of Unicode characters are available for Scheme characters and strings, but such program can apply Unicode algorithm on a sequence of unicode codepoints (e.g. vector of integers). If I implement the algorithm anyway, there's not much point to limit access to it just because the characters aren't encoded in Unicode.

So, there are two issues regarding Gauche's character set independence and Unicode algorithms:

How much of the code should be separated as 'encoding-dependent' part? If I make it completely separated, it's architectually clean, but may become maintenance burden.
Should I make the algorithm itself available regardless of Gauche's internal encoding? There's no reason not to, but doing so means I need to implement algorithms independent from Gauche's internal character representation.

The compromise

I haven't found a clear-cut solution. So far my implementation became hybrid; I lost clear separation of dependence of internal character set. I had to add #ifdefs here and there to handle special cases for specific encodings, mostly because of optimization. That is, the efficient table configurations differ depending on the encoding, so the lookup code needs to be changed as well.

I also chose to generate some tables only for Unicode codepoints. If the internal encoding differ from Unicode, I convert the internal character code to Unicode codepoint to look up the tables. It penalizes EUC-JP and Shift_JIS encodings, but it has a merit of having one table and (mostly same) lookup code. I can also use the table for procedures that work on Unicode codepoint, not on Scheme characters; so that the algorithms can be used regardless of Gauche's internal encoding.

So far, the character general category table is the only table I generate for each internal encoding. The parser (read) needs to look it up, since the symbol constituent characters are defined in terms of character general categories (at least in R6RS). I don't want the symbol reader to convert each character just to determine the boundary.

For the rest of lookup tables, including case mapping, I only generate tables specialized to be looked up by Unicode codepoints. If the internal encoding is EUC-JP or Shift_JIS, the case mapping functions does conversion. The case conversion table has a certain structure that highly depends on the Unicode character distribution, so it needs some work to regenerate it for different encodings. If the performance becomes an issue I'll consider using separate tables.

(NB: The ASCII range are the same on all the supported encodings, so I shortcut if the given character is in the range. The encoding conversion only occur when the given character is outside of it. I might further improve the performance by excluding the range of the characters that is known to have no cases from looking up the table, for the case conversion routines is identity for such characters.)

CSI---is it still worth?

It would have been a lot simpler if I fixed the internal encoding to utf-8 (or other Unicode encodings). Single set of tables, and no separate API for characters and codepoints.

Unicode is not just a coded character set (CCS). It comes with a set of definitions of higher-level operations such as case conversions, text segmentations, normalizations, etc., defined on top of the CCS. More and more applications assume these high-level operations available.

Originally I decided to support choices of internal encodings since I wanted to keep options open; that is, I didn't want to be bound to a specific internal encodings. It's just a hunch that there might be some areas where alternative encodings would be beneficial. In fact it was useful, back in 2001, when almost all files I had, and my Emacs shell buffer, were in EUC-JP. And I was having headaches when any one of tools in the toolchain didn't handle Unicode well and caused wreak havoc.

These days most of tools just works with Unicode. That imaginary benefit margin of alternative encodings gets thinner, and if I penalize them by performance of higher-level character and string operations, whatever remaining margin would go away---it'd be just simpler and faster to convert everything to Unicode internally.

Well, many pro-Unicode people would say "I told you so." Yeah, I admit it. If I keep supporting the alternative encodings, at some point I'll need to reorganize the code so that it allows me to implement various Unicode algorithms in codeset-independent way. Probably by translating the unicode character database fully into the target encoding, or something. At the moment I'll go with the current half-baked solution, for it satisfy the immediate needs. Let's see how it works out in the actual applications.

So far almost all language implementations seems to adopt Unicode. A notable exception is CRuby which allows string objects with different encodings coexists at the runtime. I wonder how CRuby developers will support, say, Unicode text segmentation, normalization, bidi, etc. on their multi character-set strings.

Tags: text.unicode, Unicode, r6rs, r7rs, encoding

Comments (0)

2011/07/11

#permalink

Generic file locking---is it really about files?

Gauche provides POSIX file locking through gauche.fcntl module, but the feature isn't available on all platforms. Providing a locking API that's always available has been on my TODO list for years.

And it's finally in. In file.util module, now you have with-lock-file. It can be used like this:

  (with-lock-file lock-file-name
    (lambda () ... do something exclusively ...))

It uses a classic inter-process locking relying on the exclusive file creation. Optionally you can use atomicity of mkdir.

The function make sure if you throw an exception in the thunk it unlocks the file, but if the application dies unexpectedly, it leaves the lock file (or directory). If it finds a lock file that's very old, you may allow it to steal the lock, assuming that the creator of the lock file died without cleaning it up. It takes quite a few parameters to configure how to handle various situations.

★ ★ ★

Originally I planned to provide a lock API that abstracts underlying lock mecanism---either POSIX fcntl file locking or using lock files (or directory/symlink/...). In fact, I attempted to write one several times. However, the more I think about it, the less the idea become appealing. POSIX fcntl locking and lockfile-based locking differ so much and attempts to hide the difference always lead to nowhere.

The biggest obstacle was that hiding POSIX file locking under the hood turned out to be an extremely difficult job, if not impossible. That is, if a locking library is using POSIX file lock down below, the user of the library must be very aware of it.

That's because, to use POSIX lock robustly, you have to make sure to meet two conditions at least: (1) Making sure the fcntl lock is actually supported on a particular file, and (2) making sure that the file to lock won't be accessed by other libraries/threads in the same process. (The latter condition came from the fact that POSIX fcntl lock is per process, not per file descriptor. See, for example, A tale of two standards and On the Brokenness of File Locking for the detailed discussions.)

These conditions aren't necessarily a show-stopper for applications---at an application level, you may be able to choose a file to lock that's likely to satisfy them. At a library level, however, it is not trivial to check if the given file is really on a filesystem that supports POSIX lock, and it is probably impossible to make sure the file won't be accessed by other parts of the program.

There are also a conceptual difference between POSIX fcntl lock and lock-file based locking---the former is to lock a file, but the latter is not really about a file. It's a way to mutex multiple processes. Of course you can use shared file lock as inter-process mutex and vice versa, but conflating these two makes me feel uneasy.

So, with-lock-file ended up to be based on the lock-file/directory. It's not very fast (involves multiple syscalls and file access) but I expect it to cover most of trivial cases to avoid race between processes.

Tags: file.util, gauche.fcntl, FileSystem, InterProcessCommunication

Comments (0)

2011/07/10

#permalink

Quasiquoting

As of 0.9.1 Gauche doesn't handle quasiquote patterns something like this:

gosh> ``(,,@(list 1 2))
`(,,@(list 1 2))

It may be arguable a bug, or an unspecified behavior in R5RS. The outer-level quasiquote should expand unquote-splicing, and that should produce (unquote 1 2); however, quasiquote syntax in R5RS does not allow such pattern, so the resulting form wouldn't be a valid quasiquotation. It doesn't explain the above result, but if it's an error in R5RS term, anything can happen. (In fact, the result wasn't intended at all; it was just an outcome of failing recognition of the pattern.)

Still, it is allowed to have (unquote 1 2) as a pure datum, it may be more reasonable that ``(,,@(list 1 2)) expands into `((unquote 1 2)).

R6RS solved this ambiguity by allowing unquote to have multiple arguments. Since it is upper compatible to R5RS, I decided to implement R6RS behavior in Gauche. Now you get this:

gosh> ``(,,@(list 1 2))
`((unquote 1 2))

And this:

gosh> `((unquote 1 2))
(1 2)

★ ★ ★

While thinking about this, I found a difference between Scheme and CL about quasiquoting that I was not aware before.

In Scheme, `a, ,a, ,@a are merely a reader-level abbreviation of (quasiquote a), (unquote a) and (unquote-splicing a), respectively. The real semantics of quasiquoting is defined in terms of the latter S-expressions.

In CL, quasiquoting (bakcquotes) are defined in reader level. That is, comma-datum doesn't need to be an abbreviation of something like (unquote datum) or (comma datum). It doesn't need to have list equivalents at all. The comma and comma-atmark can be directly parsed by the reader, and transformed to whatever piece of code that produces the desired result.

It is not obvious in a simple case, but see the followings:

In Allegro:

cl-user> `'`(,,(cons 1 (cons 2 nil)))
'(excl::bq-list (1 2))

In CLisp:

[6]> `'`(,,(cons 1 (cons 2 nil)))
'(LIST (1 2))

This means the inner backquote form are interpreted rather liberally.

On the other hand, I believe in Scheme this is the only possible result (modulo printing `a or (quasiquote a) etc.)

gosh> `'`(,,(cons 1 (cons 2 '())))
'`(,(1 2))

A practical outcome of this difference is that, in CL, you cannot write an S-expression with a comma without corresponding backquote.

In Scheme, you can write a macro that implicitly inserts quasiquote:

gosh> (define-syntax implicit-qq
        (syntax-rules ()
          [(_ x) (quasiquote x)]))
#<undef>
gosh> (implicit-qq ,(+ 1 2))
3

IIUC, you can't do this in standard CL, since the expression (implicit-qq ,(+ 1 2)) raises error at the reader, before the form gets passed to the macro expander.

Tags: quasiquote, r5rs, r6rs, CommonLisp

Comments (6)

2011/07/09

#permalink

Not like the old-school Lisp atoms

Ah it's been long since the last entry. It's not that Gauche development is staggered; the thing is, I run several development threads concurrently, using local topic branches, and it is hard to pick a topic that can make a complete entry with conclusion.

Nah, too much excuses. From now on, I dump here whatever thoughts and experiments I did, slow but constantly, even without a nice conclusion.

Ok, so let me pick up where I left... the last entry was March 16. The next thing I did was to commit the support of atoms.

No, not that traditional Lisp atoms which mean something that's not a cons. Having atomp was once important to explain how Lisp worked, but these days, with so many different datatypes, I don't think having a term for non-cons objects is important. That usage should become a history.

These days, more important atomic property is that something be atomic with regard to concurrent execution. To confess, I steal the term from Clojure. Though Gauche doesn't abstract concurrent threads nor isolate mutable operations so much as Clojure does, I do find some recurring patterns I use while writing concurrent code in Gauche, and this is one of them.

An atom here is a wrapper to bundle one or more objects that needs to be updated atomically. It is available in gauche.threads module. The following creates an atom with two vectors.

  (use gauche.threads)
  (define a (atom (vector 1) (vector 2)))

  a => #<<atom> 0x17467e0>

To access those vectors safely, you pass a procedure to atomic.

  (atomic a (lambda (v0 v1) ...do whatever with v0 and v1...))

No other thread can access/modify the atom a until atomic returns.

Gauche's builtin hashtable is not thread-safe. I was always wondering whether I should provide a separate thread-safe hashtable type, or provide a generic support to allow exclusive access to any objects, like Java's synchronized mechanism. The former would explode the number of primitive datatypes (how about thread-safe treemap? thread-safe sparse hash? &c.) The latter is cleaner, but difficult to implement efficiently (especially, think about allow synchronized on any pair!)

Now atoms can be used to provide mutable objects exclusive access. Just say (atom (make-hashtable 'eq?)), and then wrap any access to the table by atomic. I expect that most cases of trivial use of <mutex> can be replaced by atoms.

Not only wrapping mutable objects, but atom itself can be a mutable objects with synchronization. You can replace its wrapped value by atomic-update!.

  (define b (atom 0))

  (atomic-update! b (cut + 1 <>))

This makes an atom b works as a counter with mutex. If the atom wraps N values, the update procedure should be N-in N-out procedure. To extract the value, a convenient procedure atom-ref can be used.

  (atom-ref b)

You may notice that if you retrieve a mutable object wrapped by atom by atom-ref, you can modify it outside of protection. Atoms are not intended to prohibit you from doing dangerous things; they just make it easier to do things safely rather than error-prone ways.

Finally but importantly, those atom primitive consistently takes optional timeout and timeout-val arguments.

  (atomic a (^(x y) ... #t) 10.0 'timeout)

This returns #t if we can access the atom successfully, but if we cannot obtain a lock for 10.0 seconds, it gives up and returns timeout. I find it indispensable to write a robust code. The semantics of timeout and timeout-val is the same as other thread primitives.

Tags: atom, gauche.threads, Concurrency

Comments (0)

More entries ...

Gauche Devlog

About

Author

Recent Entries

Recent Comments

Recent Releases

2011/11/20

Two-argument reverse

2011/07/18

Unicode Case-mapping Support and Character Set Independence

The issues

The compromise

CSI---is it still worth?

2011/07/11

Generic file locking---is it really about files?

2011/07/10

Quasiquoting

2011/07/09

Not like the old-school Lisp atoms