Version: 5.2
HTML: Parsing Library
The html library provides
functions to read html documents and structures to represent them.
(read-xhtml port) → html? port : input-port?
(read-html port) → html? port : input-port?
Reads (X)HTML from a port, producing an html instance.
(read-html-as-xml port) → (listof content/c) port : input-port?
Reads HTML from a port, producing an X-expression compatible with the
xml library (which defines content/c).
(read-html-comments) → boolean? (read-html-comments v) → void? v : any/c
If v is not #f, then comments are read and returned. Defaults to #f.
(use-html-spec) → boolean? (use-html-spec v) → void? v : any/c
If v is not #f, then the HTML must respect the HTML specification
with regards to what elements are allowed to be the children of
other elements. For example, the top-level "<html>"
element may only contain a "<body>" and "<head>"
element. Defaults to #f.
1 Example
(module html-example racket ; Some of the symbols in html and xml conflict with ; each other and with racket/base language, so we prefix ; to avoid namespace conflict. (require (prefix-in h: html) (prefix-in x: xml)) (define an-html (h:read-xhtml (open-input-string (string-append "<html><head><title>My title</title></head><body>" "<p>Hello world</p><p><b>Testing</b>!</p>" "</body></html>")))) ; extract-pcdata: html-content/c -> (listof string) ; Pulls out the pcdata strings from some-content. (define (extract-pcdata some-content) (cond [(x:pcdata? some-content) (list (x:pcdata-string some-content))] [(x:entity? some-content) (list)] [else (extract-pcdata-from-element some-content)])) ; extract-pcdata-from-element: html-element -> (listof string) ; Pulls out the pcdata strings from an-html-element. (define (extract-pcdata-from-element an-html-element) (match an-html-element [(struct h:html-full (attributes content)) (apply append (map extract-pcdata content))] [(struct h:html-element (attributes)) '()])) (printf "~s\n" (extract-pcdata an-html)))
> (require 'html-example) ("My title" "Hello world" "Testing" "!")
2 HTML Structures
pcdata, entity, and attribute are defined in the xml documentation.
A html-content/c is either
(struct html-element (attributes) #:extra-constructor-name make-html-element) attributes : (listof attribute)
Any of the structures below inherits from html-element.
(struct html-full struct:html-element (content) #:extra-constructor-name make-html-full) content : (listof html-content/c)
Any html tag that may include content also inherits from
html-full without adding any additional fields.
(struct mzscheme html-full () #:extra-constructor-name make-mzscheme)
A mzscheme is special legacy value for the old documentation system.
A Contents-of-html is either
(struct center html-full () #:extra-constructor-name make-center)
(struct blockquote html-full () #:extra-constructor-name make-blockquote)
(struct iframe html-full () #:extra-constructor-name make-iframe)
(struct noframes html-full () #:extra-constructor-name make-noframes)
(struct noscript html-full () #:extra-constructor-name make-noscript)
(struct style html-full () #:extra-constructor-name make-style)
(struct script html-full () #:extra-constructor-name make-script)
(struct basefont html-element () #:extra-constructor-name make-basefont)
(struct br html-element () #:extra-constructor-name make-br)
(struct area html-element () #:extra-constructor-name make-area)
(struct alink html-element () #:extra-constructor-name make-alink)
(struct img html-element () #:extra-constructor-name make-img)
(struct param html-element () #:extra-constructor-name make-param)
(struct hr html-element () #:extra-constructor-name make-hr)
(struct input html-element () #:extra-constructor-name make-input)
(struct col html-element () #:extra-constructor-name make-col)
(struct isindex html-element () #:extra-constructor-name make-isindex)
(struct base html-element () #:extra-constructor-name make-base)
(struct meta html-element () #:extra-constructor-name make-meta)
(struct option html-full () #:extra-constructor-name make-option)
(struct textarea html-full () #:extra-constructor-name make-textarea)
(struct title html-full () #:extra-constructor-name make-title)
A Contents-of-head is either
A Contents-of-tr is either
(struct colgroup html-full () #:extra-constructor-name make-colgroup)
(struct thead html-full () #:extra-constructor-name make-thead)
(struct tfoot html-full () #:extra-constructor-name make-tfoot)
(struct tbody html-full () #:extra-constructor-name make-tbody)
(struct strike html-full () #:extra-constructor-name make-strike)
(struct small html-full () #:extra-constructor-name make-small)
(struct strong html-full () #:extra-constructor-name make-strong)
(struct acronym html-full () #:extra-constructor-name make-acronym)
(struct legend html-full () #:extra-constructor-name make-legend)
(struct caption html-full () #:extra-constructor-name make-caption)
(struct table html-full () #:extra-constructor-name make-table)
A Contents-of-table is either
(struct button html-full () #:extra-constructor-name make-button)
(struct fieldset html-full () #:extra-constructor-name make-fieldset)
A Contents-of-fieldset is either
G2
(struct optgroup html-full () #:extra-constructor-name make-optgroup)
(struct select html-full () #:extra-constructor-name make-select)
A Contents-of-select is either
(struct label html-full () #:extra-constructor-name make-label)
A Contents-of-dl is either
A Contents-of-pre is either
G9
G11
(struct object html-full () #:extra-constructor-name make-object)
(struct applet html-full () #:extra-constructor-name make-applet)
A Contents-of-object-applet is either
G2
A Contents-of-map is either
A Contents-of-a is either
G7
(struct address html-full () #:extra-constructor-name make-address)
A Contents-of-address is either
G5
A Contents-of-body is either
A G12 is either
A G11 is either
A G10 is either
A G9 is either
A G8 is either
A G7 is either
G8
G12
A G6 is either
G7
A G5 is either
G6
A G4 is either
G8
G10
A G3 is either
A G2 is either
G3