13 String Encodings
The scheme_utf8_decode function decodes a char array as
UTF-8 into either a UCS-4 mzchar array or a UTF-16 short
array. The scheme_utf8_encode function encodes either a UCS-4
mzchar array or a UTF-16 short array into a UTF-8
char array.
These functions can be used to check or measure an encoding or
decoding without actually producing the result decoding or encoding,
and variations of the function provide control over the handling of
decoding errors.
| | int |   | scheme_utf8_decode | ( | const unsigned char* s, |  |   |   |   |   | int start, |  |   |   |   |   | int end, |  |   |   |   |   | mzchar* us, |  |   |   |   |   | int dstart, |  |   |   |   |   | int dend, |  |   |   |   |   | intptr_t* ipos, |  |   |   |   |   | char utf16, |  |   |   |   |   | int permissive) | 
 | 
Decodes a byte array as UTF-8 to produce either Unicode code points
 into us (when utf16 is zero) or UTF-16 code units into
 us cast to short* (when utf16 is non-zero). No nul
 terminator is added to us.
The result is non-negative when all of the given bytes are decoded,
 and the result is the length of the decoding (in mzchars or
 shorts). A -2 result indicates an invalid encoding
 sequence in the given bytes (possibly because the range to decode
 ended mid-encoding), and a -3 result indicates that decoding
 stopped because not enough room was available in the result string.
The start and end arguments specify a range of s to
 be decoded. If end is negative, strlen(s) is used
 as the end.
If us is NULL, then decoded bytes are not produced, but
 the result is valid as if decoded bytes were written. The
 dstart and dend arguments specify a target range in
 us (in mzchar or short units) for the decoding; a
 negative value for dend indicates that any number of bytes can
 be written to us, which is normally sensible only when us
 is NULL for measuring the length of the decoding.
If ipos is non-NULL, it is filled with the first undecoded
 index within s. If the function result is non-negative, then
 *ipos is set to the ending index (with is end if
 non-negative, strlen(s) otherwise). If the result is
 -1 or -2, then *ipos effectively indicates
 how many bytes were decoded before decoding stopped.
If permissive is non-zero, it is used as the decoding of bytes
 that are not part of a valid UTF-8 encoding or if the input ends in the
 middle of an encoding. Thus, the function
 result can be -1 or -2 only if permissive is 0.
On Windows, when utf16 is non-zero, decoding supports a natural
 extension of UTF-8 that can produce unpaired UTF-16 surrogates in the
 result.
This function does not allocate or trigger garbage collection.
Like 
scheme_utf8_decode, but returns 
-1 if the input ends
in the middle of a UTF-8 encoding even if 
permission is
non-zero.
Added in version 6.0.1.13.
| | int |   | scheme_utf8_decode_as_prefix | ( | const unsigned char* s, |  |   |   |   |   | int start, |  |   |   |   |   | int end, |  |   |   |   |   | mzchar* us, |  |   |   |   |   | int dstart, |  |   |   |   |   | int dend, |  |   |   |   |   | intptr_t* ipos, |  |   |   |   |   | char utf16, |  |   |   |   |   | int permissive) | 
 | 
Like 
scheme_utf8_decode, but the result is always the number
 of the decoded 
mzchars or 
shorts. If a decoding error is
 encountered, the result is still the size of the decoding up until
 the error.
Like 
scheme_utf8_decode, but with fewer arguments. The
 decoding produces UCS-4 
mzchars. If the buffer 
us is
 non-
NULL, it is assumed to be long enough to hold the decoding
 (which cannot be longer than the length of the input, though it may
 be shorter). If 
len is negative, 
strlen(s) is used
 as the input length.
Like 
scheme_utf8_decode, but with fewer arguments. The
 decoding produces UCS-4 
mzchars. The buffer 
us
 must be non-
NULL, and it is assumed to be long enough to hold the
 decoding (which cannot be longer than the length of the input, though
 it may be shorter). If 
len is negative, 
strlen(s)
 is used as the input length.
In addition to the result of scheme_utf8_decode, the result
 can be -1 to indicate that the input ended with a partial
 (valid) encoding. A -1 result is possible even when
 permissive is non-zero.
Like 
scheme_utf8_decode_all with 
permissive as 
0,
 but if 
buf is not large enough (as indicated by 
blen) to
 hold the result, a new buffer is allocated. Unlike other functions,
 this one adds a nul terminator to the decoding result. The function
 result is either 
buf (if it was big enough) or a buffer
 allocated with 
scheme_malloc_atomic.
| | int |   | scheme_utf8_decode_count | ( | const unsigned char* s, |  |   |   |   |   | int start, |  |   |   |   |   | int end, |  |   |   |   |   | int* state, |  |   |   |   |   | int might_continue, |  |   |   |   |   | int permissive) | 
 | 
Like 
scheme_utf8_decode, but without producing the decoded
 
mzchars, and always returning the number of decoded
 
mzchars up until a decoding error (if any). If
 
might_continue is non-zero, the a partial valid encoding at
 the end of the input is not decoded when 
permissive is also
 non-zero.
If state is non-NULL, it holds information about partial
 encodings; it should be set to zero for an initial call, and then
 passed back to scheme_utf8_decode along with bytes that
 extend the given input (i.e., without any unused partial
 encodings). Typically, this mode makes sense only when
 might_continue and permissive are non-zero.
| | int |   | scheme_utf8_encode | ( | const mzchar* us, |  |   |   |   |   | int start, |  |   |   |   |   | int end, |  |   |   |   |   | unsigned char* s, |  |   |   |   |   | int dstart, |  |   |   |   |   | char utf16) | 
 | 
Encodes the given UCS-4 array of mzchars (if utf16 is
 zero) or UTF-16 array of shorts (if utf16 is non-zero)
 into s. The end argument must be no less than
 start.
The array s is assumed to be long enough to contain the
 encoding, but no encoding is written if s is NULL. The
 dstart argument indicates a starting place in s to hold
 the encoding. No nul terminator is added to s.
The result is the number of bytes produced for the encoding (or that
 would be produced if s was non-NULL). Encoding never
 fails.
On Windows, when utf16 is non-zero, encoding supports unpaired
 surrogates the input UTF-16 code-unit sequence, in which case
 encoding generates a natural extension of UTF-8 that encodes unpaired
 surrogates.
This function does not allocate or trigger garbage collection.
Like 
scheme_utf8_encode_all, but the length of 
buf is
 given, and if it is not long enough to hold the encoding, a buffer is
 allocated. A nul terminator is added to the encoded array. The result
 is either 
buf or an array allocated with
 
scheme_malloc_atomic.
Like 
scheme_utf8_encode_to_buffer, but the length of the
 resulting encoding (not including a nul terminator) is reported in
 
rlen if it is non-
NULL.
| | unsigned-short* |   | scheme_ucs4_to_utf16 | ( | const mzchar* text, |  |   |   |   |   | int start, |  |   |   |   |   | int end, |  |   |   |   |   | unsigned short* buf, |  |   |   |   |   | int bufsize, |  |   |   |   |   | intptr_t* ulen, |  |   |   |   |   | int term_size) | 
 | 
Converts a UCS-4 encoding (the indicated range of text) to a
 UTF-16 encoding. The end argument must be no less than
 start.
A result buffer is allocated if buf is not long enough (as
 indicated by bufsize). If ulen is non-NULL, it is
 filled with the length of the UTF-16 encoding. The term_size
 argument indicates a number of shorts to reserve at the end of
 the result buffer for a terminator (but no terminator is actually
 written).
| | mzchar* |   | scheme_utf16_to_ucs4 | ( | const unsigned short* text, |  |   |   |   |   | int start, |  |   |   |   |   | int end, |  |   |   |   |   | mzchar* buf, |  |   |   |   |   | int bufsize, |  |   |   |   |   | intptr_t* ulen, |  |   |   |   |   | int term_size) | 
 | 
Converts a UTF-16 encoding (the indicated range of text) to a
 UCS-4 encoding. The end argument must be no less than
 start.
A result buffer is allocated if buf is not long enough (as
 indicated by bufsize). If ulen is non-NULL, it is
 filled with the length of the UCS-4 encoding. The term_size
 argument indicates a number of mzchars to reserve at the end of
 the result buffer for a terminator (but no terminator is actually
 written).