1 URLs and HTTP
The
net/url library provides
utilities to parse and manipulate URIs, as specified in RFC 2396
[
RFC2396], and to use the HTTP protocol.
To access the text of a document from the web, first obtain its URL as
a string. Convert the address into a url structure using
string->url. Then, open the document using
get-pure-port or get-impure-port, depending on
whether or not you wish to examine its MIME headers. At this point,
you have a regular input port with which to process the document, as with
any other file.
Currently the only supported protocols are "http",
"https", and sometimes "file".
1.1 URL Structure
The basic structure for all URLs, which is explained in RFC 3986
[
RFC3986]. The following diagram illustrates the parts:
http://sky@www:801/cgi-bin/finger;xyz?name=shriram;host=nw#top |
{-1} {2} {3} {4}{---5---------} {6} {----7-------------} {8} |
|
1 = scheme, 2 = user, 3 = host, 4 = port, |
5 = path (two elements), 6 = param (of second path element), |
7 = query, 8 = fragment |
The strings inside the user, path, query,
and fragment fields are represented directly as Racket
strings, without URL-syntax-specific quoting. The procedures
string->url and url->string translate encodings such
as %20 into spaces and back again.
By default, query associations are parsed with either ; or
& as a separator, and they are generated with & as
a separator. The current-alist-separator-mode parameter
adjusts the behavior.
An empty string at the end of the path list corresponds to a
URL that ends in a slash. For example, the result of
(string->url "http://racket-lang.org/a/") has a
path field with strings "a" and "", while
the result of (string->url "http://racket-lang.org/a") has a
path field with only the string "a".
When a "file" URL is represented by a url structure,
the path field is mostly a list of path elements. For Unix
paths, the root directory is not included in path; its
presence or absence is implicit in the path-absolute? flag.
For Windows paths, the first element typically represents a drive, but
a UNC path is represented by a first element that is "" and
then successive elements complete the drive components that are
separated by / or \.
A pair that joins a path segment with its params in a URL.
1.2 URL Functions
An HTTP connection is created as a pure port or a
impure port. A pure port is one from which the MIME headers
have been removed, so that what remains is purely the first content
fragment. An impure port is one that still has its MIME headers.
If str starts with "file:", then the path is always
parsed as an absolute path, and the parsing details depend on
file-url-path-convention-type:
'unix : If "file:" is followed by
// and a non-/, then the first element
after the // is parsed as a host (and maybe port);
otherwise, the first element starts the path, and the host is
"".
'windows : If "file:" is followed by
//, then the // is stripped; the remainder
parsed as a Windows path. The host is always "" and
the port is always #f.
Given a base URL and a relative path, combines the two and returns a
new URL as per the URL combination specification. They are combined
according to the rules in RFC 3986 [
RFC3986].
This function does not raise any exceptions.
Turns a string into a URL, applying (what appear to be) Netscape’s
conventions on automatically specifying the scheme: a string starting
with a slash gets the scheme "file", while all others get the
scheme "http".
Generates a string corresponding to the contents of a
url
struct. For a
"file:" URL, the URL must not be relative, the
result always starts
file://, and the interpretation of the
path depends on the value of
file-url-path-convention-type:
'unix : Elements in URL are treated as path
elements. Empty strings in the path list are treated like
'same.
'windows : If the first element is "" then
the next two elements define the UNC root, and the rest of the
elements are treated as path elements. Empty strings in the
path list are treated like 'same.
The url->string procedure uses
alist->form-urlencoded when formatting the query, so it is
sensitive to the current-alist-separator-mode parameter for
determining the association separator. The default is to separate
associations with a &.
Converts a path to a
url.
Converts URL, which is assumed to be a "file" URL,
to a path.
Initiates a GET/HEAD/DELETE request for
URL and returns a
pure port corresponding to the body of the response. The
optional list of strings can be used to send header lines to the
server.
The GET method is used to retrieve whatever information is identified
by URL. If redirections is not 0, then
get-pure-port will follow redirections from the server,
up to the limit given by redirections.
The HEAD method is identical to GET, except the server must not return
a message body. The meta-information returned in a response to a HEAD
request should be identical to the information in a response to a GET
request.
The DELETE method is used to delete the entity identified by
URL.
Beware: By default, "https" scheme handling does not
verify a server’s certificate (i.e., it’s equivalent of clicking
through a browser’s warnings), so communication is safe, but the
identity of the server is not verified. To validate the server’s
certificate, set current-https-protocol to a context created
with ssl-make-client-context, and enable certificate validation
in the context with ssl-set-verify!.
The "file" scheme for URLs is handled only by
get-pure-port, which uses open-input-file, does not
handle exceptions, and ignores the optional strings.
Like
get-pure-port, etc., but the resulting
impure
port contains both the returned headers and the body. The
"file" URL scheme is not handled by these functions.
Initiates a POST/PUT request for
URL and sends the
post byte string. The result is a
pure port, which
contains the body of the response is returned. The optional list of
strings can be used to send header lines to the server.
Beware: See get-pure-port for warnings about
"https" certificate validation.
Writes the output of a pure port, which is useful for debugging purposes.
Purifies a port, returning the MIME headers, plus a leading line for
the form HTTP/‹vers› ‹code› ‹message›, where ‹vers› is
something like 1.0 or 1.1, ‹code› is an
exact integer for the response code, and ‹message› is
arbitrary text without a return or newline.
The net/head library provides procedures, such as
extract-field for manipulating the header.
Since web servers sometimes return mis-formatted replies,
purify-port is liberal in what it accepts as a header. as a
result, the result string may be ill formed, but it will either be the
empty string, or it will be a string matching the following regexp:
#rx"^HTTP/.*?(\r\n\r\n|\n\n|\r\r)"
That is, it does a GET request on url, follows up to
redirections redirections and returns a port containing
the data as well as the headers for the final connection.
Given a URL and a
connect procedure like
get-pure-port to convert the URL to an input port (either a
pure port or
impure port), calls the
handle
procedure on the port and closes the port on return. The result of the
handle procedure is the result of
call/input-url.
When a header argument is supplied, it is passed along to the
connect procedure.
The connection is made in such a way that the port is closed before
call/input-url returns, no matter how it returns. In
particular, it is closed if handle raises an exception, or if
the connection process is interruped by an asynchronous break
exception.
A parameter that determines a mapping of proxy servers used for
connections. Each mapping is a list of three elements:
the URL scheme, such as "http";
the proxy server address; and
the proxy server port number.
Currently, the only proxiable scheme is "http". The default
mapping is the empty list (i.e., no proxies).
A parameter that determines the connection mode for
"https"
connections; the parameter value is passed as the third argument to
ssl-connect when creating an
"https" connection.
Set this parameter to validate a server’s certificates, for example,
as described with
get-pure-port.
1.3 URL Unit
url@, url^, and url+scheme^ are deprecated.
They exist for backward-compatibility and will likely be removed in
the future. New code should use the net/url module.
The url+scheme^ signature contains
current-connect-scheme, which url@ binds to a
parameter. The parameter is set to the scheme of a URL when
tcp-connect is called to create a connection. A
tcp-connect variant linked to url@ can check this
parameter to choose the connection mode; in particular, net/url
supplies a tcp-connect that actually uses ssl-connect
when (current-connect-scheme) produces "https".
Note that net/url does not provide the
current-connect-scheme parameter.
1.4 URL Signature
Includes everything exported by the net/url module
except current-https-protocol. Note that the exports of
net/url and the url^ signature do not include
current-connect-scheme.
Adds current-connect-scheme to url^.