13 String Encodings
Link to this section with
@secref["im:encodings" #:doc '(lib "scribblings/inside/inside.scrbl")]
The scheme_utf8_decode function decodes a char array as
UTF-8 into either a UCS-4 mzchar array or a UTF-16 short
array. The scheme_utf8_encode function encodes either a UCS-4
mzchar array or a UTF-16 short array into a UTF-8
char array.
These functions can be used to check or measure an encoding or
decoding without actually producing the result decoding or encoding,
and variations of the function provide control over the handling of
decoding errors.
int | | scheme_utf8_decode | ( | const unsigned char* s, | | | | | int start, | | | | | int end, | | | | | mzchar* us, | | | | | int dstart, | | | | | int dend, | | | | | intptr_t* ipos, | | | | | char utf16, | | | | | int permissive) |
|
Decodes a byte array as UTF-8 to produce either Unicode code points
into us (when utf16 is zero) or UTF-16 code units into
us cast to short* (when utf16 is non-zero). No nul
terminator is added to us.
The result is non-negative when all of the given bytes are decoded,
and the result is the length of the decoding (in mzchars or
shorts). A -2 result indicates an invalid encoding
sequence in the given bytes (possibly because the range to decode
ended mid-encoding), and a -3 result indicates that decoding
stopped because not enough room was available in the result string.
The start and end arguments specify a range of s to
be decoded. If end is negative, strlen(s) is used
as the end.
If us is NULL, then decoded bytes are not produced, but
the result is valid as if decoded bytes were written. The
dstart and dend arguments specify a target range in
us (in mzchar or short units) for the decoding; a
negative value for dend indicates that any number of bytes can
be written to us, which is normally sensible only when us
is NULL for measuring the length of the decoding.
If ipos is non-NULL, it is filled with the first undecoded
index within s. If the function result is non-negative, then
*ipos is set to the ending index (with is end if
non-negative, strlen(s) otherwise). If the result is
-1 or -2, then *ipos effectively indicates
how many bytes were decoded before decoding stopped.
If permissive is non-zero, it is used as the decoding of bytes
that are not part of a valid UTF-8 encoding or if the input ends in the
middle of an encoding. Thus, the function
result can be -1 or -2 only if permissive is 0.
On Windows, when utf16 is non-zero, decoding supports a natural
extension of UTF-8 that can produce unpaired UTF-16 surrogates in the
result.
This function does not allocate or trigger garbage collection.
Like
scheme_utf8_decode, but returns
-1 if the input ends
in the middle of a UTF-8 encoding even if
permission is
non-zero.
Added in version 6.0.1.13.
int | | scheme_utf8_decode_as_prefix | ( | const unsigned char* s, | | | | | int start, | | | | | int end, | | | | | mzchar* us, | | | | | int dstart, | | | | | int dend, | | | | | intptr_t* ipos, | | | | | char utf16, | | | | | int permissive) |
|
Like
scheme_utf8_decode, but the result is always the number
of the decoded
mzchars or
shorts. If a decoding error is
encountered, the result is still the size of the decoding up until
the error.
Like
scheme_utf8_decode, but with fewer arguments. The
decoding produces UCS-4
mzchars. If the buffer
us is
non-
NULL, it is assumed to be long enough to hold the decoding
(which cannot be longer than the length of the input, though it may
be shorter). If
len is negative,
strlen(s) is used
as the input length.
Like
scheme_utf8_decode, but with fewer arguments. The
decoding produces UCS-4
mzchars. The buffer
us
must be non-
NULL, and it is assumed to be long enough to hold the
decoding (which cannot be longer than the length of the input, though
it may be shorter). If
len is negative,
strlen(s)
is used as the input length.
In addition to the result of scheme_utf8_decode, the result
can be -1 to indicate that the input ended with a partial
(valid) encoding. A -1 result is possible even when
permissive is non-zero.
Like
scheme_utf8_decode_all with
permissive as
0,
but if
buf is not large enough (as indicated by
blen) to
hold the result, a new buffer is allocated. Unlike other functions,
this one adds a nul terminator to the decoding result. The function
result is either
buf (if it was big enough) or a buffer
allocated with
scheme_malloc_atomic.
int | | scheme_utf8_decode_count | ( | const unsigned char* s, | | | | | int start, | | | | | int end, | | | | | int* state, | | | | | int might_continue, | | | | | int permissive) |
|
Like
scheme_utf8_decode, but without producing the decoded
mzchars, and always returning the number of decoded
mzchars up until a decoding error (if any). If
might_continue is non-zero, the a partial valid encoding at
the end of the input is not decoded when
permissive is also
non-zero.
If state is non-NULL, it holds information about partial
encodings; it should be set to zero for an initial call, and then
passed back to scheme_utf8_decode along with bytes that
extend the given input (i.e., without any unused partial
encodings). Typically, this mode makes sense only when
might_continue and permissive are non-zero.
int | | scheme_utf8_encode | ( | const mzchar* us, | | | | | int start, | | | | | int end, | | | | | unsigned char* s, | | | | | int dstart, | | | | | char utf16) |
|
Encodes the given UCS-4 array of mzchars (if utf16 is
zero) or UTF-16 array of shorts (if utf16 is non-zero)
into s. The end argument must be no less than
start.
The array s is assumed to be long enough to contain the
encoding, but no encoding is written if s is NULL. The
dstart argument indicates a starting place in s to hold
the encoding. No nul terminator is added to s.
The result is the number of bytes produced for the encoding (or that
would be produced if s was non-NULL). Encoding never
fails.
On Windows, when utf16 is non-zero, encoding supports unpaired
surrogates the input UTF-16 code-unit sequence, in which case
encoding generates a natural extension of UTF-8 that encodes unpaired
surrogates.
This function does not allocate or trigger garbage collection.
Like
scheme_utf8_encode_all, but the length of
buf is
given, and if it is not long enough to hold the encoding, a buffer is
allocated. A nul terminator is added to the encoded array. The result
is either
buf or an array allocated with
scheme_malloc_atomic.
Like
scheme_utf8_encode_to_buffer, but the length of the
resulting encoding (not including a nul terminator) is reported in
rlen if it is non-
NULL.
unsigned-short* | | scheme_ucs4_to_utf16 | ( | const mzchar* text, | | | | | int start, | | | | | int end, | | | | | unsigned short* buf, | | | | | int bufsize, | | | | | intptr_t* ulen, | | | | | int term_size) |
|
Converts a UCS-4 encoding (the indicated range of text) to a
UTF-16 encoding. The end argument must be no less than
start.
A result buffer is allocated if buf is not long enough (as
indicated by bufsize). If ulen is non-NULL, it is
filled with the length of the UTF-16 encoding. The term_size
argument indicates a number of shorts to reserve at the end of
the result buffer for a terminator (but no terminator is actually
written).
mzchar* | | scheme_utf16_to_ucs4 | ( | const unsigned short* text, | | | | | int start, | | | | | int end, | | | | | mzchar* buf, | | | | | int bufsize, | | | | | intptr_t* ulen, | | | | | int term_size) |
|
Converts a UTF-16 encoding (the indicated range of text) to a
UCS-4 encoding. The end argument must be no less than
start.
A result buffer is allocated if buf is not long enough (as
indicated by bufsize). If ulen is non-NULL, it is
filled with the length of the UCS-4 encoding. The term_size
argument indicates a number of mzchars to reserve at the end of
the result buffer for a terminator (but no terminator is actually
written).