9.1 Writing Regexp Patterns

A string or byte string can be used directly as a regexp pattern, or it can be prefixed with #rx to form a literal regexp value. For example, #rx"abc" is a string-based regexp value, and #rx#"abc" is a byte string-based regexp value. Alternately, a string or byte string can be prefixed with #px, as in #px"abc", for a slightly extended syntax of patterns within the string.

Most of the characters in a regexp pattern are meant to match occurrences of themselves in the text string. Thus, the pattern #rx"abc" matches a string that contains the characters a, b, and c in succession. Other characters act as metacharacters, and some character sequences act as metasequences. That is, they specify something other than their literal selves. For example, in the pattern #rx"a.c", the characters a and c stand for themselves, but the metacharacter . can match any character. Therefore, the pattern #rx"a.c" matches an a, any character, and c in succession.

When we want a literal \ inside a Racket string or regexp literal, we must escape it so that it shows up in the string at all. Racket strings use \ as the escape character, so we end up with two \s: one Racket-string \ to escape the regexp \, which then escapes the .. Another character that would need escaping inside a Racket string is ".

If we needed to match the character . itself, we can escape it by preceding it with a \. The character sequence \. is thus a metasequence, since it doesn’t match itself but rather just .. So, to match a, ., and c in succession, we use the regexp pattern #rx"a\\.c"; the double \ is an artifact of Racket strings, not the regexp pattern itself.

The regexp function takes a string or byte string and produces a regexp value. Use regexp when you construct a pattern to be matched against multiple strings, since a pattern is compiled to a regexp value before it can be used in a match. The pregexp function is like regexp, but using the extended syntax. Regexp values as literals with #rx or #px are compiled once and for all when they are read.

The regexp-quote function takes an arbitrary string and returns a string for a pattern that matches exactly the original string. In particular, characters in the input string that could serve as regexp metacharacters are escaped with a backslash, so that they safely match only themselves.

> (regexp-quote "cons")

"cons"

> (regexp-quote "list?")

"list\\?"

The regexp-quote function is useful when building a composite regexp from a mix of regexp strings and verbatim strings.