2 URLs and HTTP
To access the text of a document from the web, first obtain its URL as a string. Convert the address into a url structure using string->url. Then, open the document using get-pure-port or get-impure-port, depending on whether or not you wish to examine its MIME headers. At this point, you have a regular input port with which to process the document, as with any other file.
Currently the only supported protocols are "http", "https", and sometimes "file".
The net/url logs information and background-thread errors to a logger named 'net/url.
2.1 URL Structure
(require net/url-structs) | package: base |
struct
(struct url ( scheme user host port path-absolute? path query fragment) #:extra-constructor-name make-url) scheme : (or/c false/c string?) user : (or/c false/c string?) host : (or/c false/c string?) port : (or/c false/c exact-nonnegative-integer?) path-absolute? : boolean? path : (listof path/param?) query : (listof (cons/c symbol? (or/c false/c string?))) fragment : (or/c false/c string?)
http://sky@www:801/cgi-bin/finger;xyz?name=shriram;host=nw#top |
{-1} {2} {3} {4}{---5---------} {6} {----7-------------} {8} |
|
1 = scheme, 2 = user, 3 = host, 4 = port, |
5 = path (two elements), 6 = param (of second path element), |
7 = query, 8 = fragment |
The strings inside the user, path, query, and fragment fields are represented directly as Racket strings, without URL-syntax-specific quoting. The procedures string->url and url->string translate encodings such as %20 into spaces and back again.
By default, query associations are parsed with either ; or & as a separator, and they are generated with & as a separator. The current-alist-separator-mode parameter adjusts the behavior.
An empty string at the end of the path list corresponds to a URL that ends in a slash. For example, the result of (string->url "http://racket-lang.org/a/") has a path field with strings "a" and "", while the result of (string->url "http://racket-lang.org/a") has a path field with only the string "a".
When a "file" URL is represented by a url structure, the path field is mostly a list of path elements. For Unix paths, the root directory is not included in path; its presence or absence is implicit in the path-absolute? flag. For Windows paths, the first element typically represents a drive, but a UNC path is represented by a first element that is "" and then successive elements complete the drive components that are separated by / or \.
struct
(struct path/param (path param) #:extra-constructor-name make-path/param) path : (or/c string? (or/c 'up 'same)) param : (listof string?)
2.2 URL Parsing Functions
(require net/url-string) | package: base |
value
Added in version 6.4.0.7 of package base.
procedure
(string->url str) → url?
str : url-regexp
The contract on str insists that, if the url has a scheme, then the scheme begins with a letter and consists only of letters, numbers, +, -, and . characters.
If str starts with file: (case-insensitively) and the value of the file-url-path-convention-type parameter is 'windows, then special parsing rules apply to accommodate ill-formed but widely-recognized path encodings:
If file: is followed by //, a letter, and :, then the // is stripped and the remainder parsed as a Windows path.
If file: is followed by \\, then the \\ is stripped and the remainder parsed as a Windows path.
In both of these cases, the host is "", the port is #f, and path-element decoding (which extract parameters or replaces %20 with a space, for example) is not applied to the path.
Changed in version 6.3.0.1 of package base: Changed handling of file:
URLs when the value of
file-url-path-convention-type
is 'windows.
Changed in version 6.4.0.7: Use more specific regexp for
input contract.
Changed in version 6.5.0.3: Support a host as an IPv6 literal address
as written in [...].
procedure
(combine-url/relative base relative) → url?
base : url? relative : string?
This function does not raise any exceptions.
procedure
(netscape/string->url str) → url?
str : string?
procedure
(url->string URL) → string?
URL : url?
The url->string procedure uses alist->form-urlencoded when formatting the query, so it is sensitive to the current-alist-separator-mode parameter for determining the association separator. The default is to separate associations with a &.
The encoding of path segments and fragment is sensitive to the current-url-encode-mode parameter.
Changed in version 6.5.0.3 of package base: Support a host as an IPv6 literals addresses by writing the address in [...].
procedure
path : (or/c path-string? path-for-some-system?)
With the 'unix path convention, the host in the resulting URL is always "", and the path is absolute from the root.
With the 'windows path convention and a UNC path, the machine part of the UNC root is used as the URL’s host, and the drive part of the root is the first element of the URL’s path.
Changed in version 6.3.0.1 of package base: Changed 'windows encoding of UNC paths.
procedure
(url->path URL [kind]) → path-for-some-system?
URL : url? kind : (or/c 'unix 'windows) = (system-path-convention-type)
For the 'unix path convention, the URL’s host is ignored, and the URL’s path is formed relative to the root.
For the 'windows path convention:
A non-"" value for the URL’s host field creates a UNC path, where the host is the UNC root’s machine name, the URL’s path must be non-empty, and the first element of the URL’s path is used as the drive part of the UNC root.
For legacy reasons, if the URL’s host is "", the URL’s path contains at least three elements, and and the first element of the URL’s path is also "", then a UNC path is created by using the second and third elements of the path as the UNC root’s machine and drive, respectively.
Otherwise, the URL’s path is converted to a Windows path. The result is an absolute path if the URL’s first path element corresponds to a drive, otherwise the result is a relative path (even though URLs are not intended to represent relative paths).
Changed in version 6.3.0.1 of package base: Changed 'windows treatment of a non-"" host.
procedure
path :
(and/c (or/c path-string? path-for-some-system?) relative-path?)
parameter
(file-url-path-convention-type) → (or/c 'unix 'windows)
(file-url-path-convention-type kind) → void? kind : (or/c 'unix 'windows)
parameter
(current-url-encode-mode) → (or/c 'recommended 'unreserved)
(current-url-encode-mode mode) → void? mode : (or/c 'recommended 'unreserved)
Internally, 'recommended mode uses uri-path-segment-encode and uri-encode, while 'unreserved mode uses uri-path-segment-unreserved-encode and uri-unreserved-encode.
2.3 URL Functions
An HTTP connection is created as a pure port or a impure port. A pure port is one from which the MIME headers have been removed, so that what remains is purely the first content fragment. An impure port is one that still has its MIME headers.
procedure
(get-pure-port URL [ header #:redirections redirections]) → input-port? URL : url? header : (listof string?) = null redirections : exact-nonnegative-integer? = 0
procedure
(head-pure-port URL [header]) → input-port?
URL : url? header : (listof string?) = null
procedure
(delete-pure-port URL [header]) → input-port?
URL : url? header : (listof string?) = null
procedure
(options-pure-port URL [header]) → input-port?
URL : url? header : (listof string?) = null
The GET method is used to retrieve whatever information is identified by URL. If redirections is not 0, then get-pure-port will follow redirections from the server, up to the limit given by redirections.
The HEAD method is identical to GET, except the server must not return a message body. The meta-information returned in a response to a HEAD request should be identical to the information in a response to a GET request.
The DELETE method is used to delete the entity identified by URL.
Beware: By default, "https" scheme handling does not verify a server’s certificate (i.e., it’s equivalent of clicking through a browser’s warnings), so communication is safe, but the identity of the server is not verified. To validate the server’s certificate, set current-https-protocol to 'secure or a context created with ssl-secure-client-context.
The "file" scheme for URLs is handled only by get-pure-port, which uses open-input-file, does not handle exceptions, and ignores the optional strings.
Changed in version 6.1.1.8 of package base: Added options-pure-port.
procedure
(get-impure-port URL [header]) → input-port?
URL : url? header : (listof string?) = null
procedure
(head-impure-port URL [header]) → input-port?
URL : url? header : (listof string?) = null
procedure
(delete-impure-port URL [header]) → input-port?
URL : url? header : (listof string?) = null
procedure
(options-impure-port URL [header]) → input-port?
URL : url? header : (listof string?) = null
Changed in version 6.1.1.8 of package base: Added options-impure-port.
procedure
(post-pure-port URL post [header]) → input-port?
URL : url? post : bytes? header : (listof string?) = null
procedure
(put-pure-port URL post [header]) → input-port?
URL : url? post : bytes? header : (listof string?) = null
Beware: See get-pure-port for warnings about "https" certificate validation.
procedure
(post-impure-port URL post [header]) → input-port?
URL : url? post : bytes? header : (listof string?) = null
procedure
(put-impure-port URL post [header]) → input-port?
URL : url? post : bytes? header : (listof string?) = null
procedure
(display-pure-port in) → void?
in : input-port?
procedure
(purify-port in) → string?
in : input-port?
The net/head library provides procedures, such as extract-field for manipulating the header.
Since web servers sometimes return mis-formatted replies, purify-port is liberal in what it accepts as a header. as a result, the result string may be ill formed, but it will either be the empty string, or it will be a string matching the following regexp:
#rx"^HTTP/.*?(\r\n\r\n|\n\n|\r\r)"
procedure
(get-pure-port/headers url [ headers #:redirections redirections #:status? status?] #:connection connection)
→
input-port? string? url : url? headers : (listof string?) = '() redirections : exact-nonnegative-integer? = 0 status? : boolean? = #f connection : (or/c #f http-connection?)
The get-pure-port/headers function performs a GET request on url, follows up to redirections redirections and returns a port containing the data as well as the headers for the final connection. If status? is true, then the status line is included in the result string.
A given connection should be used for communication with a particular HTTP/1.1 server, unless connection is closed (via http-connection-close) between uses for different servers. If connection is provided, read all data from the result port before making a new request with the same connection. (Reusing a connection without reading all data may or may not work.)
procedure
(http-connection? v) → boolean?
v : any/c
procedure
procedure
(http-connection-close connection) → void?
connection : http-connection?
The make-http-connection creates a “connection” that is initially unconnected. Each call to get-pure-port/headers leaves a connection either connected or unconnected, depending on whether the server allows the connection to continue. The http-connection-close function unconnects, but it does not prevent further use of the connection value.
procedure
(call/input-url URL connect handle) → any
URL : url? connect : (url? . -> . input-port?) handle : (input-port? . -> . any) (call/input-url URL connect handle header) → any URL : url? connect : (url? (listof string?) . -> . input-port?) handle : (input-port? . -> . any) header : (listof string?)
When a header argument is supplied, it is passed along to the connect procedure.
The connection is made in such a way that the port is closed before call/input-url returns, no matter how it returns. In particular, it is closed if handle raises an exception, or if the connection process is interrupted by an asynchronous break exception.
parameter
→ (listof (list/c string? string? (integer-in 0 65535))) (current-proxy-servers mapping) → void? mapping : (listof (list/c string? string? (integer-in 0 65535)))
value
= '("http" "https" "git")
the URL scheme, such as "http", where proxiable-url-schemes lists the URL schemes that can be proxied
the proxy server address; and
the proxy server port number.
The initial value of current-proxy-servers is configured on demand from environment variables. Proxies for each URL scheme are configured from the following variables:
plt_http_proxy, PLT_HTTP_PROXY, http_proxy, HTTP_PROXY, all_proxy, and ALL_PROXY, configure the HTTP proxy, where the former takes precedence over the latter. HTTP requests will be proxied using an HTTP proxy server connection
plt_https_proxy, PLT_HTTPS_PROXY, https_proxy, HTTPS_PROXY, all_proxy, ALL_PROXY, configure the HTTPS proxy, where the former takes precedence over the latter. HTTPS connections are proxied using an HTTP “CONNECT” tunnel
plt_git_proxy, PLT_GIT_PROXY, git_proxy, GIT_PROXY, all_proxy, ALL_PROXY, configure the GIT proxy, where the former takes precedence over the latter. GIT connections are proxied using an HTTP “CONNECT” tunnel
Each environment variable contains a single URL of the form http://‹hostname›:‹portno›. If any other components of the URL are provided, a warning will be logged to a net/url logger.
The default mapping is the empty list (i.e., no proxies).
parameter
(current-no-proxy-servers) → (listof (or/c string? regexp?))
(current-no-proxy-servers dest-hosts-list) → void? dest-hosts-list : (listof (or/c string? regexp?))
strings that match host names exactly; and
regexps that match host by pattern.
If a proxy server is defined for a URL scheme, then the destination host name is checked against current-no-proxy-servers. The proxy is used if and only if the host name does not match (by the definition above) any in the list.
This parsing is consistent with the no_proxy environment variable used by other software, albeit not consistent with the regexps stored in current-no-proxy-servers.
a string beginning with a . (period): converted to a regexp that performs a suffix match on a destination host name; e.g. .racket-lang.org matches destinations of doc.racket-lang.org, pkgs.racket-lang.org, but neither doc.bracket-lang.org nor pkgs.racket-lang.org.uk;
any other string: converted to a regexp that matches the string exactly.
procedure
(proxy-server-for url-schm [dest-host-name])
→ (or/c (list/c string? string? (integer-in 0 65535)) #f) url-schm : string? dest-host-name : (or/c false/c string?) = #f
procedure
(url-exception? x) → boolean?
x : any/c
procedure
(http-sendrecv/url u [ #:method method #:headers headers #:data data #:content-decode decodes])
→
bytes? (listof bytes?) input-port? u : url? method : (or/c bytes? string? symbol?) = #"GET" headers : (listof (or/c bytes? string?)) = empty data : (or/c false/c bytes? string? data-procedure/c) = #f decodes : (listof symbol?) = '(gzip deflate)
This function does not support proxies.
Changed in version 7.6.0.9 of package base: Added support for 'deflate decoding.
procedure
(tcp-or-tunnel-connect scheme host port)
→
input-port? output-port? scheme : string? host : string? port : (between/c 1 65535)
Otherwise the call is equivalent to (tcp-connect host port).
2.4 URL HTTPS mode
(require net/url-connect) | package: base |
These bindings are provided by the net/url-connect library, and used by net/url.
parameter
(current-https-protocol) → (or/c ssl-client-context? symbol?)
(current-https-protocol protocol) → void? protocol : (or/c ssl-client-context? symbol?)
2.5 URL Unit
url@, url^, and url+scheme^ are deprecated. They exist for backward-compatibility and will likely be removed in the future. New code should use the net/url module.
(require net/url-unit) | package: compatibility-lib |
value
url@ : unit?
The url+scheme^ signature contains current-connect-scheme, which url@ binds to a parameter. The parameter is set to the scheme of a URL when tcp-connect is called to create a connection. A tcp-connect variant linked to url@ can check this parameter to choose the connection mode; in particular, net/url supplies a tcp-connect that actually uses ssl-connect when (current-connect-scheme) produces "https".
Note that net/url does not provide the current-connect-scheme parameter.
2.6 URL Signature
(require net/url-sig) | package: compatibility-lib |
signature
url^ : signature
Includes everything exported by the net/url module except current-https-protocol and current-url-encode-mode. Note that the exports of net/url and the url^ signature do not include current-connect-scheme.
signature
url+scheme^ : signature
Adds current-connect-scheme to url^.