XML: Parsing and Writing
Paul Graunke and Jay McCarthy
The xml library provides functions for parsing and
generating XML. XML can be represented as an instance of the
document structure type, or as a kind of S-expression that is
called an X-expression.
The xml library does not provide Document Type
Declaration (DTD) processing, including preservation of DTDs in read documents, or validation.
It also does not expand user-defined entities or read user-defined entities in attributes.
It does not interpret namespaces either.
Represents a location in an input stream.
Represents a source location. Other structure types extend
When XML is generated from an input stream by read-xml,
locations are represented by location instances. When XML
structures are generated by xexpr->xml, then locations are
Represents an externally defined DTD.
Represents a document type.
Represents a comment.
Represents a processing instruction.
Represents a document prolog.
Represents a document.
Represents an element.
Represents an attribute within an element.
Returns true if x is an exact-nonnegative-integer whose character interpretation under UTF-8 is from the set (#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]), in accordance with section 2.2 of the XML 1.1 spec.
Represents a symbolic or numerical entity.
Represents PCDATA content.
Represents CDATA content.
The string field is assumed to be of the form
<![CDATA[‹content›]]> with proper quoting
of ‹content›. Otherwise, write-xml generates
|(struct|| ||exn:invalid-xexpr exn:fail (code)|
| || ||#:extra-constructor-name make-exn:invalid-xexpr)|
| code : any/c|
Raised by read-xml
when an error in the XML input is found.
The following grammar describes expressions that create X-expressions:
A string is literal data. When converted to an XML stream,
the characters of the data will be escaped as necessary.
A pair represents an element, optionally with attributes. Each
attribute’s name is represented by a symbol, and its value is
represented by a string.
A symbol represents a symbolic entity. For example,
'nbsp represents .
An valid-char? represents a numeric entity. For example,
#x20 represents .
A cdata is an instance of the cdata structure type,
and a misc is an instance of the comment or
p-i structure types.
A contract that is like xexpr?
except produces a better error
message when the value is not an X-expression
2 Reading and Writing XML
Reads in an XML document from the given or current input port XML
documents contain exactly one element, raising xml-read:error
if the input stream has zero elements or more than one element.
Malformed xml is reported with source locations in the form
‹l›, ‹c›, and ‹o› are the line number, column
number, and next port position, respectively as returned by
Any non-characters other than eof read from the input-port
appear in the document content. Such special values may appear only
where XML content may. See make-input-port for information
about creating ports that return non-character values.
'(doc () (bold () "hi") " there!")
, except that the reader stops after the single element, rather than attempting to read "miscellaneous" XML content after the element. The document returned by read-xml/document
always has an empty document-misc
Reads a single XML element from the port. The next non-whitespace
character read must start an XML element, but the input port can
contain other data after the element.
Writes a document to the given output port, currently ignoring
everything except the document’s root element.
Writes document content to the given output port.
, but newlines and indentation make the output
more readable, though less technically correct when whitespace is
Writes an X-expression to the given output port, without using an intermediate
3 XML and X-expression Conversions
If this is set to non-false, then xml->xexpr
non-XML objects, such as other structs, in the content of the converted XML
and leave them in place in the resulting “X-expression
Converts XML represented with a string into an X-expression
Some elements should not contain any text, only other tags, except
they often contain whitespace for formating purposes. Given a list of
tag names as tag
s and the identity function as
produces a function
that filters out PCDATA consisting solely of whitespace from those
elements, and it raises an error if any non-whitespace text appears.
Passing in not
filters all elements which
are not named in the tags
list. Using (lambda (x) #t)
filters all elements regardless of the tags
is an X-expression
, the result
. Otherwise, exn:invalid-xexpr
s is raised, with
the a message of the form “Expected ‹something
›/” The code
field of the exception
is the part of v
that caused the exception.
, except that success-k
on each valid leaf, and fail-k
is called on invalid leaves;
may return a value instead of raising an exception
of otherwise escaping. Results from the leaves are combined with
to arrive at the final result.
A parameter that determines whether output functions should use the
<‹tag›/> tag notation instead of
for elements that have no content.
When the parameter is set to 'always, the abbreviated
notation is always used. When set of 'never, the abbreviated
notation is never generated. when set to a list of symbols is
provided, tags with names in the list are abbreviated. The default is
The abbreviated form is the preferred XML notation. However, most
browsers designed for HTML will only properly render XHTML if the
document uses a mixture of the two formats. The
html-empty-tags constant contains the W3 consortium’s
recommended list of XHTML tags that should use the shorthand.
<html><body bgcolor="red">Hi!<br />Bye!</body></html>
A parameter that controls whether consecutive whitespace is replaced
by a single space. CDATA sections are not affected. The default is
A parameter that determines whether comments are preserved or
discarded when reading XML. The default is #f, which
Controls whether xml->xexpr
drops or preserves attribute
sections for an element that has no attributes. The default is
, which means that all generated X-expression
elements have an attributes list (even if it’s empty).
5 PList Library
The xml/plist library provides the ability to read and
write XML documents that conform to the plist DTD, which is
used to store dictionaries of string–value associations. This format
is used by Mac OS X (both the operating system and its applications)
to store all kinds of data.
A plist dictionary is a value that could be created by an
expression matching the following dict-expr grammar:
| dict-expr|| ||=|| ||(list 'dict assoc-pair ...)|
| || || || || |
| assoc-pair|| ||=|| ||(list 'assoc-pair string pl-value)|
| || || || || |
| pl-value|| ||=|| ||string|
| || ||||| ||(list 'true)|
| || ||||| ||(list 'false)|
| || ||||| ||(list 'integer integer)|
| || ||||| ||(list 'real real)|
| || ||||| ||dict-expr|
| || ||||| ||(list 'array pl-value ...)|
|> (define my-dict|
| `(dict (assoc-pair "first-key"|
| "just a string with some whitespace")|
| (assoc-pair "second-key"|
| (assoc-pair "third-key"|
| (assoc-pair "fourth-key"|
| (dict (assoc-pair "inner-key"|
| (real 3.432))))|
| (assoc-pair "fifth-key"|
| (array (integer 14)|
| "another string"|
| (assoc-pair "sixth-key"|
|> (define-values (in out) (make-pipe))|
|> (write-plist my-dict out)|
|> (close-output-port out)|
|> (define new-dict (read-plist in))|
|> (equal? my-dict new-dict)|
The XML generated by write-plist in the above example
looks like the following, if re-formatted by:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist SYSTEM
<string>just a string with some whitespace</string>
6 Simple X-expression Path Queries
This library provides a simple path query library for X-expressions.
A sequence of symbols followed by an optional keyword.
The prefix of symbols specifies a path of tags from the leaves with an implicit any sequence to the root. The final, optional keyword specifies an attribute.
Returns a list of all values specified by the path p in the X-expression xe.
|> (define some-page|
| '(html (body (p ([class "awesome"]) "Hey") (p "Bar"))))|
|> (se-path*/list '(p) some-page)|
|> (se-path* '(p) some-page)|
|> (se-path* '(p #:class) some-page)|
|> (se-path*/list '(body) some-page)|
'((p ((class "awesome")) "Hey") (p "Bar"))
|> (se-path*/list '() some-page)|
'((html (body (p ((class "awesome")) "Hey") (p "Bar")))
(body (p ((class "awesome")) "Hey") (p "Bar"))
(p ((class "awesome")) "Hey")