|
Allegro CL version 6.2 New since 6.2 release. |
This document contains the following sections:
1.0 Regular Expression handling in Allegro CL: two implementations
A new regular expression handler in Allegro CL 6.2 was made available with a patch released around June 7, 2004. Use sys:update-allegro to download the patch. The new module is named :regexp2 (you have the patch if there is a regexp2.fasl in the code/ subdirectory of the Allegro directory).
The new regexp2 module has roughly the same API as the older regexp
module (which came with the original Allegro CL 6.2 release), except the functions have
slightly different names, to avoid name conflicts. The functions are named by symbols in
the excl package, and are named compile-re,
match-re, split-re, and replace-re. (The older module has
functions compile-regexp, match-regexp, split-regexp, and replace-regexp.) Both modules
can be loaded and used at the same time in a running Lisp.
The new regexp2 module is described in section Section 2.0 The new regexp2 module. The older regexp module is described in Section 3.0 The older regexp API. Much information is common to both, but we have not integrated the two descriptions because we wished to release the new module as quickly as possible.
The new regexp2 module provides a fast, mostly-Perl-compatible regular expression matcher. It handles Unicode character set (UCS-2). The module was made available by a patch on approximately June 7, 2004. Download this patch (along with other patches) in the usual fashion with sys:update-allegro. Symbols in the regexp2 module are in the excl package.
You load the regexp2 module with the form
(require :regexp2)
All the functionality of the regexp2 module is described in this document. There are no separate documentation pages for operators etc. in the module at this time. Note that there are documentation pages for operators etc. in the older regexp module (described in section Section 3.0 The older regexp API below in this document).
Like Perl, there are four types of 'mode switches' that affect the meaning of regular expressions. The switch can be specified by the keyword arguments passed to regexp APIs, or by embedding 'flags' in the regular expression itself. The following table lists the valid mode switches.
| Flag | Keyword argument | Descrption (when 'on') |
| i | :case-fold |
Case-insensitive match. Currently, case folding is only effective for ASCII characters. |
| m | :multiple-lines |
Treats the input string as multiple lines. If turned on, "^" and "$" matches the start and end of any line instead of the beginning and end of the whole string. |
| s | :single-line |
Treats the input string as a single line. If turned on, "." matches any character, even a newline. Normally "." matches any character except a newline. |
| x | :ignore-whitespace |
Extend syntax. Whitespace in the regular expression is ignored, and comments can be inserted, to increase legibility of the regular expression. |
Within a regular expression, a mode switch can be turned on/off locally by (?:...)
construct. For example, (?i:foo) makes foo match case
insensitively. (?-i:foo) makes foo match case sensitively. You
can combine multiple flags, like (?im-sx:foo). A construct like (?i)
changes the mode switch until the end of the current grouping.
| Construct | Matches |
| x | The character x |
| \\ | A backslash character |
| \0 | #\null |
| \0n | (<= 0 n 7) A character x where (= (char-code x) #on) |
| \0nn | (<= 0 n 7) A character x where (= (char-code x) #onn) |
| \nn | (<= 0 n 7) A character x where (= (char-code x) #onn), if this cannot be a back-reference |
| \mnn | (<= 0 m 3) (<= 0 n 7) A character x where (= (char-code x) #omnn) |
| \xhh | A character x where (= (char-code x) #xhh) |
| \xh | A character x where (= (char-code x) #xh) if at the end of regexp or followed by non-hexdigit char. |
| \x{h...} | A character x where (= (char-code x) #xh...). |
| \a | #\bell |
| \e | #\esc |
| \f | #\form |
| \t | #\tab |
| \n | #\linefeed |
| \r | #\return |
| \cx | control character corresponding to x (e.g. \cG for #\bell). |
| Construct | Matches |
| . | any character except newline (if 's' flag is on, matches any character including newline). |
| \d | A digit character [0-9] |
| \D | A non-digit character [^0-9] |
| \s | A whitespace character [\n\r\t\f] |
| \S | A non-whitespace character [^ \n\r\t\f] |
| \w | A word-constituent character. Currently, alphanumeric characters, #\_, and any character whose code is larger than 128 is considered as a word constituent character. The treatment of large character set may be improved in the later versions. |
| \W | A non-word-constituent character |
| Construct | Matches |
| [abc] | a, b, or c |
| [^abc] | any character except a, b, or c |
| [a-zA-Z] | character range (a through z or A through Z) |
| \n, \nn, \mnn, \x, \xh, \xhh, \x{h..} \\, \a, \e, \f, \t, \n, \r, \cx | Octal and hexadecimal character notation, and the above single character escape, are valid in character class as well. |
| other | \d, \D, \s, \S, \w, and \W can also be used within character class notation, e.g. [abc\d] == [abc0-9]. |
| Construct | Matches |
| ^ | Beginning of a string. (if 'm' flag is on, matches at the beginning of a string or just after a linefeed). |
| $ | End of a string. (if 'm' flag is on, matches at the end of a string or just before a linefeed). |
| \b | A word boundary. |
| \B | A non word boundary. |
| \A | Beginning of a string. |
| \Z | End of a string, or before linefeed at the end. |
| \z | End of a string. |
| Construct | Matches |
| X? | Zero or one occurrence of X |
| X* | Zero or more occurrence of X |
| X+ | One or more occurrence of X |
| X{n} | n times of X |
| X{n,} | n times or more repetition of X |
| X{n,m} | between n and m times of repetition of X |
| Construct | Matches |
| X?? | Zero or one occurrence of X |
| X*? | Zero or more occurrences of X |
| X+? | One or more occurrences of X |
| X{n}? | n times of X |
| X{n,}? | n times or more repetition of X |
| X{n,m}? | between n and m times of repetition of X |
| Construct | Matches |
| XY | X followed by Y |
| X|Y | X or Y |
See Section 2.3 Capturing and back reference below for the operation of capturing and reference.
| Construct | Matches |
| (X) | Groups X, and captures submatch. |
| (?<name>X) | Groups X, and captures submatch, assigning name as the name of the submatch. Name must consist of a sequence of word-constituent characters. |
| (?:X) | Groups X, but does no captures. |
| (?imsx-imsx:X) | Groups X, with turning on/off the given flags. |
| (?imsx-imsx) | Successfully matches nothing, and changes the given flags within the current group. |
| (?=X) | Zero-width positive look-ahead |
| (?!X) | Zero-width negative look-ahead |
| (?<=X) | Zero-width positive look-behind |
| (?<!X) | Zero-width negative look-behind |
| (?>X) | Matches X, and never backtrack. (also known as "standalone" group) |
| (?(n)Y) | If submatch N has a value, match Y. |
| (?(n)Y|Z) | If submatch X has a value, match Y; otherwise match Z. |
| (?(?=X)Y|Z) (?(?!X)Y|Z) (?(?<=X)Y|Z) (?(?<!X)Y|Z) |
Try the look-ahead or look-behind assertion of X, and if succeeds, match Y; otherwise match Z. '|Z' part can be omitted. |
| Construct | Matches |
| \n | Matches the n-th submatch (n>=1). If n-th submatch doesn't have a value, it doesn't match anything (match just fails). If n is more than one digit, it becomes a back reference only if there are that many capturing groups in the regular expressions; otherwise, it is interpreted as an octal character notation if possible. |
| \k<name> | Matches the submatch with name. If there are more than one submatches that has the name, this tries to match the one that has the biggest number first, and if it fails, the smaller numbered submatches are tried. |
Capturing submatches (X) and (?<name>X) are
numbered in the order of its opening parenthesis from left to right. Named submatch is
counted the same as unnamed submatch, and can be back-referenced by both name and number.
When the input string matches X, the portion of the input string is saved. It can be referenced within a regular expression, by the back reference construct, or can be returned to the user program as a submatch.
If the capturing construct matches more than once, it saves the last match.
There are five functions (listed first) and three macros in the API:
Arguments: string &key case-fold ignore-whitespace multiple-lines single-line (return :unknown) back-end
Given a string that describes a regular expression, creates a compiled regular expression object.
The keyword argument case-fold controls whether the result matcher becomes case-sensitive or not by default. (The 'i' flag of Perl RE).
If the keyword argument ignore-whitespace is given and true, whitespace characters within string are ignored as regexp spec. (The 'x' flag of Perl RE).
If the keyword argument multiple-lines is given and true, the created matcher recognizes multiple lines, i.e. ^ and $ matches beginning and end of individual lines. Without this argument, ^ and $ matches only beginning and end of the input string. (The 'm' flag of Perl RE).
If the keyword argument single-line is given and true, '.' will match any character including newline. Without single-line, '.' matches any character except newline. (The 's' flag of Perl RE).
The keyword argument back-end can be either regexp:native
or regexp:vm. If it is regexp:native, the compiled regexp code
is a native code; which runs fast, but it takes more time to compile. If it is regexp:vm,
the compiled regexp code is an instruction sequence of a virtual machine to match the
input string.
The keyword argument return specifies how the created matcher returns
the result. If it is nil, the matcher will return only a boolean value which
indicates whether the input string matches the regexp or not. If it is :string,
the matcher also returns the substrings of the matched region and all submatches of the
input string. If it is :index, the matcher returns pairs of start and end
indices of the matched region and all submatches. If the keyword argument is not given at
compile time, the matcher is created so that it will accept return keyword argument
at the matching time; however, compile-re
will generate more efficient code if it knows the return type at the compile time.
Arguments: regexp input &key return case-fold single-line multiple-lines ignore-whitespace start end back-end
regexp must be a string specifying a regular expression, or a compiled
regular expression (an object returned by compile-re).
This function tries to match the string input with the regexp. If they do
not match, nil is returned as the single returned value. If they match, t
is returned as a first value, and the match result of the whole, submatch 1, submatch 2,
... are returned as the other values.
The way of returning match result can be specified by the return keyword
argument. If it is :string (the default), substrings of input are
returned. If it is :index, a cons of start and end index within input
is returned for each match.
The keyword argument start and end limits the region of input.
The keyword argument case-fold and back-end are passed to the regexp:compile-re function, when a string is given as the value of regexp. They are ignored when regexp is a compiled regular expression.
A compiler-macro is defined for this function. If regexp is a literal string, compile-re is called at compilation time, so you do not need to pay the cost of regexp compilation at run-time.
Arguments: regexp string &key count (start 0) end case-fold single-line multiple-lines ignore-whitespace
This function scans string for a delimiter given by regexp and returns a list of substrings. If count is given, then split into no more than count substrings, in which case the last substring will contain the rest of the string.
regexp should match a non-zero length string, or an error is signaled.
Arguments: string regexp substitution &key count (start 0) end case-fold single-line multiple-lines ignore-whitespace
This function replaces substring(s) in string that matches regexp for substitution, and returns the replaced strings.
regexp can be a string that specifies a regular expression, or an already-compiled (by compile-re) regular expression. regexp should match a non-zero length string, or an error is signaled.
substitution can be a string, or a function that takes one argument, a list of match substrings returned by the regexp matcher. The function must return a string, which is then used as a substitution string.
The keyword argument count limits the maximum number of substitutions.
If it is nil, all occurrences of regexp in string are replaced.
The keyword arguments start and end limit the region in string where matching occurs.
Other keyword arguments are passed to compile-re to compile regexp.
Arguments: regexp string indexes selector &key (type :string)
This is a convenience function to extract a specified submatch information from the value returned by match-re.
regexp should be a compiled regexp, and indexes should be a list
of (start . end) index pairs returned by match-re. selector should be an
integer that specifies a captured submatch (or 0 for the whole match), or a string or a
symbol that names a named submatch.
Alternatively, you can pass the opaque 'match' object returned by match-re with :return :match
argument, as regexp. The match object knows the information about the match
results, so you don't need string and indexes; just pass nil to
them.
The type argument may be :string, :after, :before
or :index. It specifies the return value. If it is :string (the
default), the substring of the specified submatch is returned. If it is :after
or :before, the substring after or before the specified submatch is returned,
respectively. If it is :index, a pair of (start . end) indexes
are returned.
If the specified submatch isn't included in indexes, nil is
returned.
A typical usage pattern of this function is like this:
(let ((r (multiple-value-list (match-re regexp string :return :index)))) (re-submatch regexp string (cdr r) 3)))
Or, if you use the match object, like this:
(let ((match (match-re regexp string :return :match))) (re-submatch match nil nil 3))
The following macros were added by a patch around July 30, 2004. Examples of their use follows their definitions.
Arguments: regexp bindings &body body
regexp must be a string that specifies a regular expression, or a list of such a string followed by options, or an expression that evaluates to a compiled regular expression.
re-lambda returns a
closure which takes the arguments (string &key if-does-not-match).
When the closure is called, it scans the given string by regexp, and if it matches, then evaluates body with the environment where the variables in bindings are bound to the substrings of the match(es). The closure returns the result(s) of the last expression of body.
bindings is like the binding form of let. Each element of bindings must be either one of the following:
(var integer) ;; var binds to integer-th submatch (0 for whole). (var string/symbol) ;; var binds to the submatch named by
string/symbol. var ;; var binds to the submatch named by var. If the specified match doesn't have a value, var is bound to nil.
You can also bind a substring before or after the specified submatch, by giving modifier
like (var integer :before) or (var integer :after).
If string doesn't match regexp, body is not evaluated, and the value given to the keyword argument if-does-not-match is returned.
If regexp is the form of (string-re options ...), then
options are passed to compile-re
to create a compiled regular expression. It takes place at macro-expansion time, so
options must only contain constant literals.
Arguments: regexp string bindings &body body
regexp and bindings are the same as of re-lambda. string must be
an expression that evaluates to a string. This macro evaluates string, matches it
with regexp, and if it matches, bind variables with (sub)matches according to bindings,
then evaluates body. If string doesn't match regexp, body is
not evaluated and nil is returned.
Arguments: string &rest clauses
string must be an expression that evaluates to a string. The clauses are then examined one by one, checking whether it matches string or not. clauses must be one of the following forms:
regexp, (sub)matches are
bound to the variables according to bindings, then expr ...
are evaluated. See the documentation of re-lambda
for information on the specifiction of regexp and bindings.
(:test proc expr ...): proc must be an
expression that evaluates to a procedure that takes single argument. proc
is called with string, and if it returns a true value, evaluates expr ....
(:testlet proc var expr ...): like :test
above, but binds the result of proc to var while
evaluating expr.... (t expr ...): always match and evaluates expr ....
This can be used to describe the fallback case. If no clause match, nil is returned.
cl-user(4): (setq f
(re-lambda "([^ ]+) ([^ ]+) ([^ ]+)"
((foo 1) (bar 2) (baz 3))
(list foo bar baz)))
#<Interpreted Function (unnamed) @ #x71ed7892>
cl-user(5): (funcall f "foo the bar")
("foo" "the" "bar")
cl-user(6): (re-let "([^ ]+) ([^ ]+) ([^ ]+)"
"foo the bar"
((foo 1) (bar 2) (baz 3))
(list foo bar baz))
("foo" "the" "bar")
cl-user(7):
cl-user(9): (re-case "foo the barmy"
("foo a (.*)" ((it 1)) (list it))
("foo the (.*)" ((it 1)) (list it))
(t :no-match))
("barmy")
cl-user(10): (re-case "foo a barmy"
("foo a (.*)" ((it 1)) (list it))
("foo the (.*)" ((it 1)) (list it))
(t :no-match))
("barmy")
cl-user(11): (re-case "foo xx barmy"
("foo a (.*)" ((it 1)) (list it))
("foo the (.*)" ((it 1)) (list it))
(t :no-match))
:no-match
cl-user(12):
\N{name} named char
\lx lowercase x
\ux uppercase x
\Lx..\E lowercase x..
\Ux..\E uppercase x..
\Qx..\E quote non-alphanumeric chars in x.
Theoretically ACL's regexp library uses the same mechanism that Perl and CL-PPCRE are using: Nondeterministic finite automaton (NFA). When there are multiple possibilies of matching, it "remembers" the current state and tries each possibility one at a time. If the trial fails, it backs up the last saved state and tries the next possibility; it is called "backtrack". It is possible to compose a very short regular expression that does a huge number of backtracks; if you have nested repetitions and alternations, the number of required backtracks grows exponentially.
The regexp optimizer tries to reduce the number of backtrack, but it is not always possible. Here are some tips to improve the performance of the matcher.
(?>X) in the inner loop.
It effectively turns the nested repetition into unnested repetition from the viewpoint of
NFA. Lots of optimizations can be done for unnested repetition, but not much can be done
for a nested one. Note that an unnested greedy repetition followed by a character or a
character set that are exclusive to the beginning of the repetition becomes a
non-backtracking regexp; for example, the regular expression (\w+)\s+ runs
without backtracking.
A regular expression is a pattern that matches one or more strings. Technically, the patterns we support go beyond regular expressions, however we'll continue to call them regular expressions. Allegro CL provides regular expression (regexp) functionality as described in this document.
In a regular expression string, characters are either normal or special. A normal character stands for itself, a special character has a meaning described below.
There are two contexts in a regular expression: characters within a [...] expression and those elsewhere.
Special characters outside a [...] are: * . [ ] \ + ^ $
(asterisk, period, left bracket, right bracket, backslash, plus sign, circumflex accent,
and dollar sign). These characters can be made normal characters by preceding them with a
backslash. Some of these characters are only special in certain places in the string (see
below for the details).
Special characters within a [...] are: ^ ]. These characters can
only be made non-special by placing them in a certain position in the [...] expression
(more details below). Note that backslash is not a special character in this context.
There are certain characters that when preceded by a backslash outside of a
[...] expression turn into special characters. Those characters are: ( ) | w W b B 0
1 2 3 4 5 6 7 8 9.
A regular expression is defined as:
| a | Where a is any non-special character, matches itself. |
| xy | Where x and y are regular expressions, matches the concatenation of the regular expressions. |
| m* | A single character regular expression followed by * matches zero or more occurrences of m. If there is a choice, it always matches the longest sequence of m's. |
| m+ | A single character expression followed by + matches one or more occurrences of m. If there is a choice, it always matches the longest sequence of m's. |
| . | A period matches any character (except newline--see notes on the match-regexp function). |
| ^ | If this is the first character of the regular expression string it forces the match to start at the beginning of the to-be-matched string. If this character appears after the beginning of the string it stands for itself. |
| $ | If this is the last character of the regular expression string it forces the match to end at the end of the to-be-matched string. If this character appears before the end of the regular expression string, it stands for itself. |
| [..] | This matches exactly one character from the set of characters denoted by the pattern
inside the brackets. This pattern has a different form than elsewhere in the regular
expression: [abcs-y3-8] matches either a, b or c,
or s through y, or 3 through 8. You
can invert the set using the caret as the first character. [^a-z] matches any
character not in the range a through z. In order to include the
right bracket in the set it has to be listed first (or after the caret): []ab]
matches a, b or the right bracket. [^]ab] matches
any character except a, b or the right bracket. In order to
match a hyphen it has to be first or last: [b-] matches b or a
hyphen. In order to match a caret it has to be somewhere other than the first character. [b^]
matches b or a caret. |
| \(x\) | This grouping syntax matches whatever x matches, and at the same time remembers what x matches. There can be up to 9 groups defined in a regular expression string. Each group is given a number from 1 to 9 based on the order in which they appear in the pattern string. |
| x\|y | This tries to match x, and if that fails it tries to match y. To
control what constitutes x and y you can use the \( \)
grouping. For example, abc\|def means abc or def
whereas a\(bc\|de\)f means a followed by bc or de
followed by f. |
| \n | If n is 1 through 9 then this stands for the string matched by group 1 through 9. If there is no string assigned to group n then this is match failure. There is no group 0 so the form \0 is illegal and an error is signaled when the regular expression is parsed. |
| \w | Matches a word character. It is equivalent to [a-zA-Z0-9]. |
| \W | Matches anything but a word character. It is equivalent to [^a-zA-Z0-9]. |
| \b | Matches a blank character (one of space, form feed, tab and newline). |
| \B | matches anything but a blank character (one of space, form feed, tab and newline). |
| \a | For any character a not mentioned above stands for a itself. It is unwise to put in extra backslashes since while \x may stand for just x today, in the future it may have a different meaning. |
When typing a regular expression in Lisp source code keep in mind that in order to represent a backslash in a string constant you need two backslashes. The Lisp reader reads "foo\+" as "foo+", when what you probably wanted was "foo\\+" (where you are putting a backslash in front of the + to remove its special meaning so you could match the string foo+.)
The + and * characters must follow a single character regular expression. They
cannot follow a group expression, even if that group matches just one character. In other
words \(a\)* is not legal. [a]* is legal since the [..]
expression always matches one character.
Various operating systems and other utilities provide regular expression parsers. Here are some notes on some of them:
x\{m,n\} meaning between m
and n occurrences of x. It also supports \< and \>
as matching word beginnings and word endings, where a word is a C identifier. The UNIX
version does not support the + special character (possibly since you can get
it with \{1,\}). x? meaning 0 or 1 occurrence of x. We
can get that with \(x\|\). Several functions are supported in this facility. Each is documented on its own page. The functions are
We repeat most of the information for compile-regexp and match-regexp (the two more important functions) here for reading convenience.
Arguments: regexp
Compiles the string regexp into a regular expression object and returns that object. If there are syntax errors in the string, an error will be signaled.
Arguments: regexp match-string &key (newlines-special t) case-fold shortest (return :string) (start 0) (end (length match-string))
The regexp argument is a regular expression object (the
result of regexp-compile) or it is a string (in which case it will be
compiled into a regular expression object). The match-string is a string to
match against the regular expression. The function will attempt to match the regular
expression against the match-string starting at the first character of the match-string,
and if that fails it will look inside the match-string for a match (unless
the regular expression begins with a caret).
The keyword arguments are:
:newlines-special |
If true then a newline will not match the . [i.e. the dot] regular expression. This is useful to prevent multiline matches. |
:case-fold |
If true then the match-string is effectively mapped to lower case before
doing the match. Thus lower case characters in the regular expression match either case
and upper case characters match only upper case characters. |
:return |
A failed match returns If the value of return is If the value of return is |
:start |
The first character in the match-string to match against. |
:end |
One past the last character in the match-string to match against. |
| shortest | This makes match-regexp
return the shortest rather than the longest match. One motivation for this is parsing
html. Suppose you want to search for the next item in italics, which in html looks like <i>foo</i>
. Suppose your string is "<i>foo</i> and
<i>bar</i>". The following example shows the difference:
user(10): (match-regexp "<i>.*</i>" string)
t
"<i>foo</i> and <i>bar</i>
user(11): (match-regexp "<i>.*</i>" string
:shortest t)
t
"<i>foo</i>"
|
Compilation note: there is a compiler macro defined for match-regexp
that will handle in a special way match-regexp calls where the first argument
is a constant string. That is, this form (match-regexp "foo" x)
will compile to code that will arrange to call compile-regexp on the string
when the code is fasled in. Since the cost of compile-regexp is high, this
saves a lot of time.
Copyright (c) 1998-2004, Franz Inc. Oakland, CA., USA. All rights
reserved.
Documentation for Allegro CL version 6.2.
Created 2004.7.25.
|
Allegro CL version 6.2 New since 6.2 release. |