8.15
HTML: Parsing Library
The html library provides functions
to read conformant HTML4 documents and structures to represent
them. Since html assumes documents are conformant and
is restricted to the older specific, it should be viewed as a legacy
library. We suggest using the html-parsing package for modern
Web scraping.
procedure
(read-xhtml port) → html?
port : input-port?
procedure
port : input-port?
Reads (X)HTML from a port, producing an html instance.
procedure
(read-html-as-xml port) → (listof content/c)
port : input-port?
Reads HTML from a port, producing a list of XML content, each of which could be
turned into an X-expression, if necessary, with xml->xexpr.
parameter
(read-html-comments v) → void? v : any/c
If v is not #f, then comments are read and returned. Defaults to #f.
parameter
(use-html-spec v) → void? v : any/c
If v is not #f, then the HTML must respect the HTML specification
with regards to what elements are allowed to be the children of
other elements. For example, the top-level "<html>"
element may only contain a "<body>" and "<head>"
element. Defaults to #t.
1 Example
(module html-example racket ; Some of the symbols in html and xml conflict with ; each other and with racket/base language, so we prefix ; to avoid namespace conflict. (require (prefix-in h: html) (prefix-in x: xml)) (define an-html (h:read-xhtml (open-input-string (string-append "<html><head><title>My title</title></head><body>" "<p>Hello world</p><p><b>Testing</b>!</p>" "</body></html>")))) ; extract-pcdata: html-content/c -> (listof string) ; Pulls out the pcdata strings from some-content. (define (extract-pcdata some-content) (cond [(x:pcdata? some-content) (list (x:pcdata-string some-content))] [(x:entity? some-content) (list)] [else (extract-pcdata-from-element some-content)])) ; extract-pcdata-from-element: html-element -> (listof string) ; Pulls out the pcdata strings from an-html-element. (define (extract-pcdata-from-element an-html-element) (match an-html-element [(struct h:html-full (attributes content)) (apply append (map extract-pcdata content))] [(struct h:html-element (attributes)) '()])) (printf "~s\n" (extract-pcdata an-html)))
> (require 'html-example) ("My title" "Hello world" "Testing" "!")
2 HTML Structures
pcdata, entity, and attribute are defined in the xml documentation.
value
A html-content/c is either
struct
(struct html-element (attributes) #:extra-constructor-name make-html-element) attributes : (listof attribute)
Any of the structures below inherits from html-element.
struct
(struct html-full struct:html-element (content) #:extra-constructor-name make-html-full) content : (listof html-content/c)
Any html tag that may include content also inherits from
html-full without adding any additional fields.
struct
(struct mzscheme html-full () #:extra-constructor-name make-mzscheme)
A mzscheme is special legacy value for the old documentation system.
A Contents-of-html is either
struct
(struct center html-full () #:extra-constructor-name make-center)
struct
(struct blockquote html-full () #:extra-constructor-name make-blockquote)
struct
(struct iframe html-full () #:extra-constructor-name make-iframe)
struct
(struct noframes html-full () #:extra-constructor-name make-noframes)
struct
(struct noscript html-full () #:extra-constructor-name make-noscript)
struct
(struct style html-full () #:extra-constructor-name make-style)
struct
(struct script html-full () #:extra-constructor-name make-script)
struct
(struct basefont html-element () #:extra-constructor-name make-basefont)
struct
(struct br html-element () #:extra-constructor-name make-br)
struct
(struct area html-element () #:extra-constructor-name make-area)
struct
(struct alink html-element () #:extra-constructor-name make-alink)
struct
(struct img html-element () #:extra-constructor-name make-img)
struct
(struct param html-element () #:extra-constructor-name make-param)
struct
(struct hr html-element () #:extra-constructor-name make-hr)
struct
(struct input html-element () #:extra-constructor-name make-input)
struct
(struct col html-element () #:extra-constructor-name make-col)
struct
(struct isindex html-element () #:extra-constructor-name make-isindex)
struct
(struct base html-element () #:extra-constructor-name make-base)
struct
(struct meta html-element () #:extra-constructor-name make-meta)
struct
(struct option html-full () #:extra-constructor-name make-option)
struct
(struct textarea html-full () #:extra-constructor-name make-textarea)
struct
(struct title html-full () #:extra-constructor-name make-title)
A Contents-of-head is either
A Contents-of-tr is either
struct
(struct colgroup html-full () #:extra-constructor-name make-colgroup)
struct
(struct thead html-full () #:extra-constructor-name make-thead)
struct
(struct tfoot html-full () #:extra-constructor-name make-tfoot)
struct
(struct tbody html-full () #:extra-constructor-name make-tbody)
struct
(struct strike html-full () #:extra-constructor-name make-strike)
struct
(struct small html-full () #:extra-constructor-name make-small)
struct
(struct strong html-full () #:extra-constructor-name make-strong)
struct
(struct acronym html-full () #:extra-constructor-name make-acronym)
struct
(struct legend html-full () #:extra-constructor-name make-legend)
struct
(struct caption html-full () #:extra-constructor-name make-caption)
struct
(struct table html-full () #:extra-constructor-name make-table)
A Contents-of-table is either
struct
(struct button html-full () #:extra-constructor-name make-button)
struct
(struct fieldset html-full () #:extra-constructor-name make-fieldset)
A Contents-of-fieldset is either
G2
struct
(struct optgroup html-full () #:extra-constructor-name make-optgroup)
struct
(struct select html-full () #:extra-constructor-name make-select)
A Contents-of-select is either
struct
(struct label html-full () #:extra-constructor-name make-label)
A Contents-of-dl is either
A Contents-of-pre is either
G9
G11
struct
(struct object html-full () #:extra-constructor-name make-object)
struct
(struct applet html-full () #:extra-constructor-name make-applet)
A Contents-of-object-applet is either
G2
A Contents-of-map is either
A Contents-of-a is either
G7
struct
(struct address html-full () #:extra-constructor-name make-address)
A Contents-of-address is either
G5
A Contents-of-body is either
A G12 is either
A G11 is either
A G10 is either
A G9 is either
A G8 is either
A G7 is either
G8
G12
A G6 is either
G7
A G5 is either
G6
A G4 is either
G8
G10
A G3 is either
A G2 is either
G3