HTML:   Parsing Library
read-xhtml
read-html
read-html-as-xml
read-html-comments
use-html-spec
1 Example
2 HTML Structures
html-content/  c
html-element
html-full
mzscheme
html
div
center
blockquote
ins
del
dd
li
th
td
iframe
noframes
noscript
style
script
basefont
br
area
alink
img
param
hr
input
col
isindex
base
meta
option
textarea
title
head
tr
colgroup
thead
tfoot
tbody
tt
i
b
u
s
strike
big
small
em
strong
dfn
code
samp
kbd
var
cite
abbr
acronym
sub
sup
span
bdo
font
p
h1
h2
h3
h4
h5
h6
q
dt
legend
caption
table
button
fieldset
optgroup
select
label
form
ol
ul
dir
menu
dl
pre
object
applet
-map
a
address
body
6.5

HTML: Parsing Library

 (require html) package: html-lib
The html library provides functions to read conformant HTML4 documents and structures to represent them. Since html assumes documents are conformant and is restricted to the older specific, it should be viewed as a legacy library. We suggest using the html-parsing package for modern Web scraping.

procedure

(read-xhtml port)  html?

  port : input-port?

procedure

(read-html port)  html?

  port : input-port?
Reads (X)HTML from a port, producing an html instance.

procedure

(read-html-as-xml port)  (listof content/c)

  port : input-port?
Reads HTML from a port, producing a list of XML content, each of which could be turned into an X-expression, if necessary, with xml->xexpr.

parameter

(read-html-comments)  boolean?

(read-html-comments v)  void?
  v : any/c
If v is not #f, then comments are read and returned. Defaults to #f.

parameter

(use-html-spec)  boolean?

(use-html-spec v)  void?
  v : any/c
If v is not #f, then the HTML must respect the HTML specification with regards to what elements are allowed to be the children of other elements. For example, the top-level "<html>" element may only contain a "<body>" and "<head>" element. Defaults to #t.

1 Example

(module html-example racket
 
  ; Some of the symbols in html and xml conflict with
  ; each other and with racket/base language, so we prefix
  ; to avoid namespace conflict.
  (require (prefix-in h: html)
           (prefix-in x: xml))
 
  (define an-html
    (h:read-xhtml
     (open-input-string
      (string-append
       "<html><head><title>My title</title></head><body>"
       "<p>Hello world</p><p><b>Testing</b>!</p>"
       "</body></html>"))))
 
  ; extract-pcdata: html-content/c -> (listof string)
  ; Pulls out the pcdata strings from some-content.
  (define (extract-pcdata some-content)
    (cond [(x:pcdata? some-content)
           (list (x:pcdata-string some-content))]
          [(x:entity? some-content)
           (list)]
          [else
           (extract-pcdata-from-element some-content)]))
 
  ; extract-pcdata-from-element: html-element -> (listof string)
  ; Pulls out the pcdata strings from an-html-element.
  (define (extract-pcdata-from-element an-html-element)
    (match an-html-element
      [(struct h:html-full (attributes content))
       (apply append (map extract-pcdata content))]
 
      [(struct h:html-element (attributes))
       '()]))
 
  (printf "~s\n" (extract-pcdata an-html)))

 

> (require 'html-example)

("My title" "Hello world" "Testing" "!")

2 HTML Structures

pcdata, entity, and attribute are defined in the xml documentation.

A html-content/c is either

struct

(struct html-element (attributes)
    #:extra-constructor-name make-html-element)
  attributes : (listof attribute)
Any of the structures below inherits from html-element.

struct

(struct html-full struct:html-element (content)
    #:extra-constructor-name make-html-full)
  content : (listof html-content/c)
Any html tag that may include content also inherits from html-full without adding any additional fields.

struct

(struct mzscheme html-full ()
    #:extra-constructor-name make-mzscheme)
A mzscheme is special legacy value for the old documentation system.

struct

(struct html html-full ()
    #:extra-constructor-name make-html)
A html is (make-html (listof attribute) (listof Contents-of-html))

A Contents-of-html is either

struct

(struct div html-full ()
    #:extra-constructor-name make-div)

struct

(struct center html-full ()
    #:extra-constructor-name make-center)

struct

(struct blockquote html-full ()
    #:extra-constructor-name make-blockquote)

struct

(struct ins html-full ()
    #:extra-constructor-name make-ins)
An Ins is (make-ins (listof attribute) (listof G2))

struct

(struct del html-full ()
    #:extra-constructor-name make-del)

struct

(struct dd html-full ()
    #:extra-constructor-name make-dd)

struct

(struct li html-full ()
    #:extra-constructor-name make-li)

struct

(struct th html-full ()
    #:extra-constructor-name make-th)

struct

(struct td html-full ()
    #:extra-constructor-name make-td)

struct

(struct iframe html-full ()
    #:extra-constructor-name make-iframe)

struct

(struct noframes html-full ()
    #:extra-constructor-name make-noframes)

struct

(struct noscript html-full ()
    #:extra-constructor-name make-noscript)

struct

(struct style html-full ()
    #:extra-constructor-name make-style)

struct

(struct script html-full ()
    #:extra-constructor-name make-script)

struct

(struct basefont html-element ()
    #:extra-constructor-name make-basefont)

struct

(struct br html-element ()
    #:extra-constructor-name make-br)

struct

(struct area html-element ()
    #:extra-constructor-name make-area)

struct

(struct alink html-element ()
    #:extra-constructor-name make-alink)

struct

(struct img html-element ()
    #:extra-constructor-name make-img)

struct

(struct param html-element ()
    #:extra-constructor-name make-param)

struct

(struct hr html-element ()
    #:extra-constructor-name make-hr)

struct

(struct input html-element ()
    #:extra-constructor-name make-input)

struct

(struct col html-element ()
    #:extra-constructor-name make-col)

struct

(struct isindex html-element ()
    #:extra-constructor-name make-isindex)

struct

(struct base html-element ()
    #:extra-constructor-name make-base)

struct

(struct meta html-element ()
    #:extra-constructor-name make-meta)

struct

(struct option html-full ()
    #:extra-constructor-name make-option)

struct

(struct textarea html-full ()
    #:extra-constructor-name make-textarea)

struct

(struct title html-full ()
    #:extra-constructor-name make-title)

struct

(struct head html-full ()
    #:extra-constructor-name make-head)
A head is (make-head (listof attribute) (listof Contents-of-head))

A Contents-of-head is either

struct

(struct tr html-full ()
    #:extra-constructor-name make-tr)
A tr is (make-tr (listof attribute) (listof Contents-of-tr))

A Contents-of-tr is either

struct

(struct colgroup html-full ()
    #:extra-constructor-name make-colgroup)

struct

(struct thead html-full ()
    #:extra-constructor-name make-thead)

struct

(struct tfoot html-full ()
    #:extra-constructor-name make-tfoot)

struct

(struct tbody html-full ()
    #:extra-constructor-name make-tbody)

struct

(struct tt html-full ()
    #:extra-constructor-name make-tt)

struct

(struct i html-full ()
    #:extra-constructor-name make-i)
An i is (make-i (listof attribute) (listof G5))

struct

(struct b html-full ()
    #:extra-constructor-name make-b)

struct

(struct u html-full ()
    #:extra-constructor-name make-u)
An u is (make-u (listof attribute) (listof G5))

struct

(struct s html-full ()
    #:extra-constructor-name make-s)

struct

(struct strike html-full ()
    #:extra-constructor-name make-strike)

struct

(struct big html-full ()
    #:extra-constructor-name make-big)

struct

(struct small html-full ()
    #:extra-constructor-name make-small)

struct

(struct em html-full ()
    #:extra-constructor-name make-em)

struct

(struct strong html-full ()
    #:extra-constructor-name make-strong)

struct

(struct dfn html-full ()
    #:extra-constructor-name make-dfn)

struct

(struct code html-full ()
    #:extra-constructor-name make-code)

struct

(struct samp html-full ()
    #:extra-constructor-name make-samp)

struct

(struct kbd html-full ()
    #:extra-constructor-name make-kbd)

struct

(struct var html-full ()
    #:extra-constructor-name make-var)

struct

(struct cite html-full ()
    #:extra-constructor-name make-cite)

struct

(struct abbr html-full ()
    #:extra-constructor-name make-abbr)

struct

(struct acronym html-full ()
    #:extra-constructor-name make-acronym)

struct

(struct sub html-full ()
    #:extra-constructor-name make-sub)

struct

(struct sup html-full ()
    #:extra-constructor-name make-sup)

struct

(struct span html-full ()
    #:extra-constructor-name make-span)

struct

(struct bdo html-full ()
    #:extra-constructor-name make-bdo)

struct

(struct font html-full ()
    #:extra-constructor-name make-font)

struct

(struct p html-full ()
    #:extra-constructor-name make-p)

struct

(struct h1 html-full ()
    #:extra-constructor-name make-h1)

struct

(struct h2 html-full ()
    #:extra-constructor-name make-h2)

struct

(struct h3 html-full ()
    #:extra-constructor-name make-h3)

struct

(struct h4 html-full ()
    #:extra-constructor-name make-h4)

struct

(struct h5 html-full ()
    #:extra-constructor-name make-h5)

struct

(struct h6 html-full ()
    #:extra-constructor-name make-h6)

struct

(struct q html-full ()
    #:extra-constructor-name make-q)

struct

(struct dt html-full ()
    #:extra-constructor-name make-dt)

struct

(struct legend html-full ()
    #:extra-constructor-name make-legend)

struct

(struct caption html-full ()
    #:extra-constructor-name make-caption)

struct

(struct table html-full ()
    #:extra-constructor-name make-table)
A table is (make-table (listof attribute) (listof Contents-of-table))

A Contents-of-table is either

struct

(struct button html-full ()
    #:extra-constructor-name make-button)

struct

(struct fieldset html-full ()
    #:extra-constructor-name make-fieldset)
A fieldset is (make-fieldset (listof attribute) (listof Contents-of-fieldset))

A Contents-of-fieldset is either

struct

(struct optgroup html-full ()
    #:extra-constructor-name make-optgroup)

struct

(struct select html-full ()
    #:extra-constructor-name make-select)
A select is (make-select (listof attribute) (listof Contents-of-select))

A Contents-of-select is either

struct

(struct label html-full ()
    #:extra-constructor-name make-label)

struct

(struct form html-full ()
    #:extra-constructor-name make-form)

struct

(struct ol html-full ()
    #:extra-constructor-name make-ol)

struct

(struct ul html-full ()
    #:extra-constructor-name make-ul)

struct

(struct dir html-full ()
    #:extra-constructor-name make-dir)

struct

(struct menu html-full ()
    #:extra-constructor-name make-menu)

struct

(struct dl html-full ()
    #:extra-constructor-name make-dl)
A dl is (make-dl (listof attribute) (listof Contents-of-dl))

A Contents-of-dl is either

struct

(struct pre html-full ()
    #:extra-constructor-name make-pre)
A pre is (make-pre (listof attribute) (listof Contents-of-pre))

A Contents-of-pre is either
  • G9

  • G11

struct

(struct object html-full ()
    #:extra-constructor-name make-object)
An object is (make-object (listof attribute) (listof Contents-of-object-applet))

struct

(struct applet html-full ()
    #:extra-constructor-name make-applet)
An applet is (make-applet (listof attribute) (listof Contents-of-object-applet))

A Contents-of-object-applet is either

struct

(struct -map html-full ()
    #:extra-constructor-name make--map)
A Map is (make--map (listof attribute) (listof Contents-of-map))

A Contents-of-map is either

struct

(struct a html-full ()
    #:extra-constructor-name make-a)
An a is (make-a (listof attribute) (listof Contents-of-a))

A Contents-of-a is either

struct

(struct address html-full ()
    #:extra-constructor-name make-address)
An address is (make-address (listof attribute) (listof Contents-of-address))

A Contents-of-address is either

struct

(struct body html-full ()
    #:extra-constructor-name make-body)
A body is (make-body (listof attribute) (listof Contents-of-body))

A Contents-of-body is either

A G12 is either

A G11 is either

A G10 is either

A G9 is either

A G8 is either

A G7 is either
  • G8

  • G12

A G6 is either

A G5 is either

A G4 is either
  • G8

  • G10

A G3 is either

A G2 is either