6.4
HTML: Parsing Library
The html library provides
functions to read html documents and structures to represent them.
procedure
(read-xhtml port) → html?
port : input-port?
procedure
port : input-port?
Reads (X)HTML from a port, producing an html instance.
procedure
(read-html-as-xml port) → (listof content/c)
port : input-port?
Reads HTML from a port, producing a list of XML content, each of which could be
turned into an X-expression, if necessary, with xml->xexpr.
parameter
(read-html-comments v) → void? v : any/c
If v is not #f, then comments are read and returned. Defaults to #f.
parameter
(use-html-spec v) → void? v : any/c
If v is not #f, then the HTML must respect the HTML specification
with regards to what elements are allowed to be the children of
other elements. For example, the top-level "<html>"
element may only contain a "<body>" and "<head>"
element. Defaults to #t.
1 Example
(module html-example racket ; Some of the symbols in html and xml conflict with ; each other and with racket/base language, so we prefix ; to avoid namespace conflict. (require (prefix-in h: html) (prefix-in x: xml)) (define an-html (h:read-xhtml (open-input-string (string-append "<html><head><title>My title</title></head><body>" "<p>Hello world</p><p><b>Testing</b>!</p>" "</body></html>")))) ; extract-pcdata: html-content/c -> (listof string) ; Pulls out the pcdata strings from some-content. (define (extract-pcdata some-content) (cond [(x:pcdata? some-content) (list (x:pcdata-string some-content))] [(x:entity? some-content) (list)] [else (extract-pcdata-from-element some-content)])) ; extract-pcdata-from-element: html-element -> (listof string) ; Pulls out the pcdata strings from an-html-element. (define (extract-pcdata-from-element an-html-element) (match an-html-element [(struct h:html-full (attributes content)) (apply append (map extract-pcdata content))] [(struct h:html-element (attributes)) '()])) (printf "~s\n" (extract-pcdata an-html)))
> (require 'html-example) ("My title" "Hello world" "Testing" "!")
2 HTML Structures
pcdata, entity, and attribute are defined in the xml documentation.
value
A html-content/c is either
struct
(struct html-element (attributes) #:extra-constructor-name make-html-element) attributes : (listof attribute)
Any of the structures below inherits from html-element.
struct
(struct html-full struct:html-element (content) #:extra-constructor-name make-html-full) content : (listof html-content/c)
Any html tag that may include content also inherits from
html-full without adding any additional fields.
struct
(struct mzscheme html-full () #:extra-constructor-name make-mzscheme)
A mzscheme is special legacy value for the old documentation system.
A Contents-of-html is either
struct
(struct center html-full () #:extra-constructor-name make-center)
struct
(struct blockquote html-full () #:extra-constructor-name make-blockquote)
struct
(struct iframe html-full () #:extra-constructor-name make-iframe)
struct
(struct noframes html-full () #:extra-constructor-name make-noframes)
struct
(struct noscript html-full () #:extra-constructor-name make-noscript)
struct
(struct style html-full () #:extra-constructor-name make-style)
struct
(struct script html-full () #:extra-constructor-name make-script)
struct
(struct basefont html-element () #:extra-constructor-name make-basefont)
struct
(struct br html-element () #:extra-constructor-name make-br)
struct
(struct area html-element () #:extra-constructor-name make-area)
struct
(struct alink html-element () #:extra-constructor-name make-alink)
struct
(struct img html-element () #:extra-constructor-name make-img)
struct
(struct param html-element () #:extra-constructor-name make-param)
struct
(struct hr html-element () #:extra-constructor-name make-hr)
struct
(struct input html-element () #:extra-constructor-name make-input)
struct
(struct col html-element () #:extra-constructor-name make-col)
struct
(struct isindex html-element () #:extra-constructor-name make-isindex)
struct
(struct base html-element () #:extra-constructor-name make-base)
struct
(struct meta html-element () #:extra-constructor-name make-meta)
struct
(struct option html-full () #:extra-constructor-name make-option)
struct
(struct textarea html-full () #:extra-constructor-name make-textarea)
struct
(struct title html-full () #:extra-constructor-name make-title)
A Contents-of-head is either
A Contents-of-tr is either
struct
(struct colgroup html-full () #:extra-constructor-name make-colgroup)
struct
(struct thead html-full () #:extra-constructor-name make-thead)
struct
(struct tfoot html-full () #:extra-constructor-name make-tfoot)
struct
(struct tbody html-full () #:extra-constructor-name make-tbody)
struct
(struct strike html-full () #:extra-constructor-name make-strike)
struct
(struct small html-full () #:extra-constructor-name make-small)
struct
(struct strong html-full () #:extra-constructor-name make-strong)
struct
(struct acronym html-full () #:extra-constructor-name make-acronym)
struct
(struct legend html-full () #:extra-constructor-name make-legend)
struct
(struct caption html-full () #:extra-constructor-name make-caption)
struct
(struct table html-full () #:extra-constructor-name make-table)
A Contents-of-table is either
struct
(struct button html-full () #:extra-constructor-name make-button)
struct
(struct fieldset html-full () #:extra-constructor-name make-fieldset)
A Contents-of-fieldset is either
G2
struct
(struct optgroup html-full () #:extra-constructor-name make-optgroup)
struct
(struct select html-full () #:extra-constructor-name make-select)
A Contents-of-select is either
struct
(struct label html-full () #:extra-constructor-name make-label)
A Contents-of-dl is either
A Contents-of-pre is either
G9
G11
struct
(struct object html-full () #:extra-constructor-name make-object)
struct
(struct applet html-full () #:extra-constructor-name make-applet)
A Contents-of-object-applet is either
G2
A Contents-of-map is either
A Contents-of-a is either
G7
struct
(struct address html-full () #:extra-constructor-name make-address)
A Contents-of-address is either
G5
A Contents-of-body is either
A G12 is either
A G11 is either
A G10 is either
A G9 is either
A G8 is either
A G7 is either
G8
G12
A G6 is either
G7
A G5 is either
G6
A G4 is either
G8
G10
A G3 is either
A G2 is either
G3