2011/11/20
Two-argument reverse
Again I kept silence for over 4 months. I released 0.9.2 in August. Since then, quite a few changes are made into the development HEAD, and I'd like to start explaining them as I approach to release 0.9.3.
Some of the changes are small in the code, but can have great impact---meaning, it may change how you code in Gauche. However, they take a lot of words to explain; I need some more time.
For today, I pick a rather small topic to get up to the speed. It may even seem too trivial to mention, but somehow, it matters to me. Maybe because it is related to how I approach to a program.
★ ★ ★
A common pattern to transform a list is to tail-recurse the input with consing the results, and at the end reverse the result list to return:
(let loop ([xs xs] [rs '()]) (if (null? xs) (reverse rs) (loop (cdr xs) (cons (do-something (car xs)) rs))))
It is not necessarily a good style. Olin Shivers commented in his srfi-1 reference implementation that, this style is just trading execution stack for cons cells, and in stack-based system you should rather use a natural (non-tail) recursion:
(let rec ([xs xs]) (if (null? xs) '() (cons (do-something (car xs)) (rec (cdr xs)))))
In Gauche, the latter is just as fast as the former to a list up to about a thousand elements. However, if the input exceeds the limit and recursion gets too deep, Gauche starts moving the stack frames to the heap (a kind of stack GC), so it gets worse than using intermediate lists and reverse at the end. So I typically use cons-then-reverse idiom in Gauche.
Now, it is also common to write a function that adds the result
to the existing pile of data.
If we do non-tail recursive calls, it is trivial; I just
replace ()
with the existing pile of data:
(let rec ([xs xs]) (if (null? xs) existing-pile-of-data (cons (do-something (car xs)) (rec (cdr xs)))))
If we use tail-recursive call and reverse, however,
we need to reverse the accumulated results, then append
the existing pile of data. Srfi-1 has a procedure
specifically for this pattern, append-reverse
:
(let loop ([xs xs] [rs '()]) (if (null? xs) (append-reverse rs previous-results) (loop (cdr xs) (cons (do-something (car xs)) rs))))
There's nothing wrong with this code;
the append-reverse
procedure is meant to avoid
unnecessary copying. Except that I always feel
something awkward in the name: I feel it somewhat
miss the point by thinking the operation as
"reverse, then append".
Look at this:
(reverse xs) == (fold cons '() xs) (append-reverse xs init) == (fold cons init xs)
It looks like reverse
is really a special
case of append-reverse
with the init
to be ()
. Thinking the operation as
"reverse, then append" is actually upside-down.
It may be more natural to think
the operation represented by append-reverse
as a fundamental one, and reverse
is a
shorthand for the special case.
Then, wouldn't it be reasonable to extend reverse
this way?
(define (reverse xs :optional (tail '())) (fold cons tail xs))
If you stick to the idea that reverse
is an operation
to reverse the order of the given list, this extension
seems contrived. However, how about thinking reverse
as a list constructor, which happens to take
initial content from xs but in the reverse order?
Actually, 0.9.2's reverse
is already extended to
take an optional tail argument, though it's not
documented.
Tradition has great inertia, so I don't expect this
extension gets widely accepted; after all, you can
use srfi-1's append-reverse
if you want this
functionality. But I did get a little piece of mind
with this extension.
Tags: reverse, srfi-1, append-reverse
2011/07/18
Unicode Case-mapping Support and Character Set Independence
In my local branch, I've been working on enhancing
case-mapping procedures such as char-upcase
and
string-downcase
to cover the entire character range.
I expect I can merge it into the master branch soon.
As of 0.9.1 Gauche's case mapping procedures only handles ASCII range, and pass through any characters beyond it. But R6RS and R7RS (draft) both define those case-mapping procedures in terms of Unicode definitions, so it's about time to modernize Gauche's character handling as well.
It isn't much trouble to generate tables from Unicode character database, although there's room to optimize them for speed and/or space (I'll post another entry to explain the strategy I took).
The issues
The trouble is that Gauche is designed to be independent
from the internal character encoding scheme as much as possible.
While the internal encoding needs to be chosen at
compile-time, the code that depends on a specific encoding
is small and mostly confined in gauche/char_*.h
files.
Case-mapping adds nontrivial amount of code; it's not just a
simple table of character-to-character mapping.
In string case-mapping (e.g. string-downcase
) each
character's mapping is context-dependent, and to properly
map the cases, we need to find word boundaries. That means
we have to implement text segmentation, which is defined
in UAX #29.
That, in turn, depends on the character property database.
(The complication of this issue may be illustrated by this:
gosh> (string-downcase "ΧΑΟΣΧΑΟΣ.ΧΑΟΣ. Σ.") "χαοσχαοσ.χαος. σ."
According to UAX#29 and R6RS, only the third sigma should take the final form.)
If I want to keep encoding-dependent code separate, I need to prepare those tables for each supported encodings and choose one at the compile time according to the configuration option of internal encoding. The tables can be generated mechanically, but having more code that selectively used by compiler options means more testing and maintenance headache. I don't like it.
Furthermore, full support of Unicode-defined algorithms such as text segmentation and normalization are useful even if Gauche's internal encoding is not utf-8. For example, if Gauche's internal encoding is euc-jp, then only a subset of Unicode characters are available for Scheme characters and strings, but such program can apply Unicode algorithm on a sequence of unicode codepoints (e.g. vector of integers). If I implement the algorithm anyway, there's not much point to limit access to it just because the characters aren't encoded in Unicode.
So, there are two issues regarding Gauche's character set independence and Unicode algorithms:
- How much of the code should be separated as 'encoding-dependent' part? If I make it completely separated, it's architectually clean, but may become maintenance burden.
- Should I make the algorithm itself available regardless of Gauche's internal encoding? There's no reason not to, but doing so means I need to implement algorithms independent from Gauche's internal character representation.
The compromise
I haven't found a clear-cut solution. So far my implementation
became hybrid; I lost clear separation of dependence of internal
character set. I had to add #ifdef
s here and there
to handle special cases for specific encodings, mostly
because of optimization. That is, the efficient table
configurations differ depending on the encoding, so the
lookup code needs to be changed as well.
I also chose to generate some tables only for Unicode codepoints. If the internal encoding differ from Unicode, I convert the internal character code to Unicode codepoint to look up the tables. It penalizes EUC-JP and Shift_JIS encodings, but it has a merit of having one table and (mostly same) lookup code. I can also use the table for procedures that work on Unicode codepoint, not on Scheme characters; so that the algorithms can be used regardless of Gauche's internal encoding.
So far, the character general category table is the only table
I generate for each internal encoding.
The parser (read
) needs to look it up, since the symbol
constituent characters are defined in terms of character general
categories (at least in R6RS). I don't want the symbol reader
to convert each character just to determine the boundary.
For the rest of lookup tables, including case mapping, I only generate tables specialized to be looked up by Unicode codepoints. If the internal encoding is EUC-JP or Shift_JIS, the case mapping functions does conversion. The case conversion table has a certain structure that highly depends on the Unicode character distribution, so it needs some work to regenerate it for different encodings. If the performance becomes an issue I'll consider using separate tables.
(NB: The ASCII range are the same on all the supported encodings, so I shortcut if the given character is in the range. The encoding conversion only occur when the given character is outside of it. I might further improve the performance by excluding the range of the characters that is known to have no cases from looking up the table, for the case conversion routines is identity for such characters.)
CSI---is it still worth?
It would have been a lot simpler if I fixed the internal encoding to utf-8 (or other Unicode encodings). Single set of tables, and no separate API for characters and codepoints.
Unicode is not just a coded character set (CCS). It comes with a set of definitions of higher-level operations such as case conversions, text segmentations, normalizations, etc., defined on top of the CCS. More and more applications assume these high-level operations available.
Originally I decided to support choices of internal encodings since I wanted to keep options open; that is, I didn't want to be bound to a specific internal encodings. It's just a hunch that there might be some areas where alternative encodings would be beneficial. In fact it was useful, back in 2001, when almost all files I had, and my Emacs shell buffer, were in EUC-JP. And I was having headaches when any one of tools in the toolchain didn't handle Unicode well and caused wreak havoc.
These days most of tools just works with Unicode. That imaginary benefit margin of alternative encodings gets thinner, and if I penalize them by performance of higher-level character and string operations, whatever remaining margin would go away---it'd be just simpler and faster to convert everything to Unicode internally.
Well, many pro-Unicode people would say "I told you so." Yeah, I admit it. If I keep supporting the alternative encodings, at some point I'll need to reorganize the code so that it allows me to implement various Unicode algorithms in codeset-independent way. Probably by translating the unicode character database fully into the target encoding, or something. At the moment I'll go with the current half-baked solution, for it satisfy the immediate needs. Let's see how it works out in the actual applications.
So far almost all language implementations seems to adopt Unicode. A notable exception is CRuby which allows string objects with different encodings coexists at the runtime. I wonder how CRuby developers will support, say, Unicode text segmentation, normalization, bidi, etc. on their multi character-set strings.
Tags: text.unicode, Unicode, r6rs, r7rs, encoding
2011/07/11
Generic file locking---is it really about files?
Gauche provides POSIX file locking through gauche.fcntl
module,
but the feature isn't available on all platforms. Providing
a locking API that's always available has been on my TODO list for years.
And it's finally in. In file.util
module, now you have
with-lock-file
. It can be used like this:
(with-lock-file lock-file-name (lambda () ... do something exclusively ...))
It uses a classic inter-process locking relying on the exclusive
file creation. Optionally you can use atomicity of mkdir
.
The function make sure if you throw an exception in the thunk it unlocks the file, but if the application dies unexpectedly, it leaves the lock file (or directory). If it finds a lock file that's very old, you may allow it to steal the lock, assuming that the creator of the lock file died without cleaning it up. It takes quite a few parameters to configure how to handle various situations.
★ ★ ★
Originally I planned to provide a lock API that abstracts underlying lock mecanism---either POSIX fcntl file locking or using lock files (or directory/symlink/...). In fact, I attempted to write one several times. However, the more I think about it, the less the idea become appealing. POSIX fcntl locking and lockfile-based locking differ so much and attempts to hide the difference always lead to nowhere.
The biggest obstacle was that hiding POSIX file locking under the hood turned out to be an extremely difficult job, if not impossible. That is, if a locking library is using POSIX file lock down below, the user of the library must be very aware of it.
That's because, to use POSIX lock robustly, you have to make sure to meet two conditions at least: (1) Making sure the fcntl lock is actually supported on a particular file, and (2) making sure that the file to lock won't be accessed by other libraries/threads in the same process. (The latter condition came from the fact that POSIX fcntl lock is per process, not per file descriptor. See, for example, A tale of two standards and On the Brokenness of File Locking for the detailed discussions.)
These conditions aren't necessarily a show-stopper for applications---at an application level, you may be able to choose a file to lock that's likely to satisfy them. At a library level, however, it is not trivial to check if the given file is really on a filesystem that supports POSIX lock, and it is probably impossible to make sure the file won't be accessed by other parts of the program.
There are also a conceptual difference between POSIX fcntl lock and lock-file based locking---the former is to lock a file, but the latter is not really about a file. It's a way to mutex multiple processes. Of course you can use shared file lock as inter-process mutex and vice versa, but conflating these two makes me feel uneasy.
So, with-lock-file
ended up to be based on
the lock-file/directory. It's not very fast (involves multiple syscalls
and file access) but I expect it to cover most of trivial cases to avoid
race between processes.
Tags: file.util, gauche.fcntl, FileSystem, InterProcessCommunication
2011/07/10
Quasiquoting
As of 0.9.1 Gauche doesn't handle quasiquote patterns something like this:
gosh> ``(,,@(list 1 2)) `(,,@(list 1 2))
It may be arguable a bug, or an unspecified behavior in R5RS.
The outer-level quasiquote should expand unquote-splicing,
and that should produce (unquote 1 2)
; however, quasiquote
syntax in R5RS does not allow such pattern, so the resulting
form wouldn't be a valid quasiquotation. It doesn't explain
the above result, but if it's an error in R5RS term,
anything can happen.
(In fact, the result wasn't intended at all;
it was just an outcome of failing recognition of the pattern.)
Still, it is allowed to have (unquote 1 2)
as a pure datum,
it may be more reasonable that ``(,,@(list 1 2))
expands into `((unquote 1 2))
.
R6RS solved this ambiguity by allowing unquote
to have
multiple arguments. Since it is upper compatible to R5RS,
I decided to implement R6RS behavior in Gauche. Now you get this:
gosh> ``(,,@(list 1 2)) `((unquote 1 2))
And this:
gosh> `((unquote 1 2)) (1 2)
★ ★ ★
While thinking about this, I found a difference between Scheme and CL about quasiquoting that I was not aware before.
In Scheme, `a
, ,a
, ,@a
are merely a reader-level
abbreviation
of (quasiquote a)
, (unquote a)
and (unquote-splicing a)
,
respectively. The real semantics of quasiquoting is defined in terms
of the latter S-expressions.
In CL, quasiquoting (bakcquotes) are defined in reader level.
That is, comma-datum doesn't need to be an abbreviation of something
like (unquote datum)
or (comma datum)
. It doesn't
need to have list equivalents at all. The comma and comma-atmark
can be directly parsed by the reader, and transformed
to whatever piece of code that produces the desired result.
It is not obvious in a simple case, but see the followings:
In Allegro:
cl-user> `'`(,,(cons 1 (cons 2 nil))) '(excl::bq-list (1 2))
In CLisp:
[6]> `'`(,,(cons 1 (cons 2 nil))) '(LIST (1 2))
This means the inner backquote form are interpreted rather liberally.
On the other hand, I believe in Scheme this is the only
possible result (modulo printing `a
or (quasiquote a)
etc.)
gosh> `'`(,,(cons 1 (cons 2 '()))) '`(,(1 2))
A practical outcome of this difference is that, in CL, you cannot write an S-expression with a comma without corresponding backquote.
In Scheme, you can write a macro that implicitly inserts quasiquote:
gosh> (define-syntax implicit-qq (syntax-rules () [(_ x) (quasiquote x)])) #<undef> gosh> (implicit-qq ,(+ 1 2)) 3
IIUC, you can't do this in standard CL,
since the expression (implicit-qq ,(+ 1 2))
raises
error at the reader, before the form gets passed to the macro
expander.
Tags: quasiquote, r5rs, r6rs, CommonLisp
2011/07/09
Not like the old-school Lisp atoms
Ah it's been long since the last entry. It's not that Gauche development is staggered; the thing is, I run several development threads concurrently, using local topic branches, and it is hard to pick a topic that can make a complete entry with conclusion.
Nah, too much excuses. From now on, I dump here whatever thoughts and experiments I did, slow but constantly, even without a nice conclusion.
Ok, so let me pick up where I left... the last entry was March 16. The next thing I did was to commit the support of atoms.
No, not that traditional Lisp atoms which mean something
that's not a cons.
Having atomp
was once important to explain how Lisp worked,
but these days, with so many different datatypes, I don't think
having a term for non-cons objects is important. That usage
should become a history.
These days, more important atomic property is that something be atomic with regard to concurrent execution. To confess, I steal the term from Clojure. Though Gauche doesn't abstract concurrent threads nor isolate mutable operations so much as Clojure does, I do find some recurring patterns I use while writing concurrent code in Gauche, and this is one of them.
An atom here is a wrapper to bundle one or more objects
that needs to be updated atomically. It is available
in gauche.threads
module. The following creates
an atom with two vectors.
(use gauche.threads) (define a (atom (vector 1) (vector 2)))
a => #<<atom> 0x17467e0>
To access those vectors safely, you pass a procedure to atomic
.
(atomic a (lambda (v0 v1) ...do whatever with v0 and v1...))
No other thread can access/modify the atom a
until
atomic
returns.
Gauche's builtin hashtable is not thread-safe. I was always
wondering whether I should provide a separate thread-safe hashtable
type,
or provide a generic support to allow exclusive access to any
objects, like Java's synchronized
mechanism. The former
would explode the number of primitive datatypes
(how about thread-safe treemap? thread-safe sparse hash? &c.)
The latter is cleaner, but difficult to implement efficiently
(especially, think about allow synchronized
on any pair!)
Now atoms can be used to provide mutable objects exclusive access.
Just say (atom (make-hashtable 'eq?))
, and then
wrap any access to the table by atomic
. I expect that
most cases of trivial use of <mutex>
can be replaced
by atoms.
Not only wrapping mutable objects, but atom itself can be
a mutable objects with synchronization. You can replace
its wrapped value by atomic-update!
.
(define b (atom 0))
(atomic-update! b (cut + 1 <>))
This makes an atom b works as a counter with mutex. If the
atom wraps N values, the update procedure should be N-in N-out
procedure. To extract the value, a convenient procedure
atom-ref
can be used.
(atom-ref b)
You may notice that if you retrieve a mutable object wrapped
by atom by atom-ref
, you can modify it outside of protection.
Atoms are not intended to prohibit you from doing dangerous things;
they just make it easier to do things safely rather than error-prone ways.
Finally but importantly, those atom primitive consistently takes optional timeout and timeout-val arguments.
(atomic a (^(x y) ... #t) 10.0 'timeout)
This returns #t if we can access the atom successfully,
but if we cannot obtain a lock for 10.0 seconds, it gives up
and returns timeout
. I find it indispensable to
write a robust code. The semantics of timeout and timeout-val
is the same as other thread primitives.
Tags: atom, gauche.threads, Concurrency
Comments (0)