13.1.1 Encodings and Locales

When a port is provided to a character-based operation, such as read-char or read, the port’s bytes are read and interpreted as a UTF-8 encoding of characters. Thus, reading a single character may require reading multiple bytes, and a procedure like char-ready? may need to peek several bytes into the stream to determine whether a character is available. In the case of a byte stream that does not correspond to a valid UTF-8 encoding, functions such as read-char may need to peek one byte ahead in the stream to discover that the stream is not a valid encoding.

When an input port produces a sequence of bytes that is not a valid UTF-8 encoding in a character-reading context, then bytes that constitute an invalid sequence are converted to the character #\uFFFD. Specifically, bytes 255 and 254 are always converted to #\uFFFD, bytes in the range 192 to 253 produce #\uFFFD when they are not followed by bytes that form a valid UTF-8 encoding, and bytes in the range 128 to 191 are converted to #\uFFFD when they are not part of a valid encoding that was started by a preceding byte in the range 192 to 253. To put it another way, when reading a sequence of bytes as characters, a minimal set of bytes are changed to the encoding of #\uFFFD so that the entire sequence of bytes is a valid UTF-8 encoding.

See Byte Strings for procedures that facilitate conversions using UTF-8 or other encodings. See also reencode-input-port and reencode-output-port for obtaining a UTF-8-based port from one that uses a different encoding of characters.

A locale captures information about a user’s language-specific interpretation of character sequences. In particular, a locale determines how strings are “alphabetized,” how a lowercase character is converted to an uppercase character, and how strings are compared without regard to case. String operations such as string-ci=? are not sensitive to the current locale, but operations such as string-locale-ci=? (see Strings) produce results consistent with the current locale.

A locale also designates a particular encoding of code-point sequences into byte sequences. Racket generally ignores this aspect of the locale, with a few notable exceptions: command-line arguments passed to Racket as byte strings are converted to character strings using the locale’s encoding; command-line strings passed as byte strings to other processes (through subprocess) are converted to byte strings using the locale’s encoding; environment variables are converted to and from strings using the locale’s encoding; filesystem paths are converted to and from strings (for display purposes) using the locale’s encoding; and, finally, Racket provides functions such as string->bytes/locale to specifically invoke a locale-specific encoding.

A Unix user selects a locale by setting environment variables, such as LC_ALL. On Windows and Mac OS X, the operating system provides other mechanisms for setting the locale. Within Racket, the current locale can be changed by setting the current-locale parameter. The locale name within Racket is a string, and the available locale names depend on the platform and its configuration, but the "" locale means the current user’s default locale; on Windows and Mac OS X, the encoding for "" is always UTF-8, and locale-sensitive operations use the operating system’s native interface. (In particular, setting the LC_ALL and LC_CTYPE environment variables does not affect the locale "" on Mac OS X. Use getenv and current-locale to explicitly install the environment-specified locale, if desired.) Setting the current locale to #f makes locale-sensitive operations locale-insensitive, which means using the Unicode mapping for case operations and using UTF-8 for encoding.

parameter
(current-locale) → (or/c string? #f)
(current-locale locale) → void?
locale : (or/c string? #f)

A parameter that determines the current locale for procedures such as string-locale-ci=?.

When locale sensitivity is disabled by setting the parameter to #f, strings are compared, etc., in a fully portable manner, which is the same as the standard procedures. Otherwise, strings are interpreted according to a locale setting (in the sense of the C library’s setlocale). The "" locale is always an alias for the current machine’s default locale, and it is the default. The "C" locale is also always available; setting the locale to "C" is the same as disabling locale sensitivity with #f only when string operations are restricted to the first 128 characters. Other locale names are platform-specific.

String or character printing with write is not affected by the parameter, and neither are symbol case or regular expressions (see Regular Expressions).

top ← prev up next →

1	Language Model
2	Notation for Documentation
3	Syntactic Forms
4	Datatypes
5	Structures
6	Classes and Objects
7	Units
8	Contracts
9	Pattern Matching
10	Control Flow
11	Concurrency and Parallelism
12	Macros
13	Input and Output
14	Reflection and Security
15	Operating System
16	Memory Management
17	Unsafe Operations
18	Running Racket
	Bibliography
	Index

13.1	Ports
13.2	Byte and String Input
13.3	Byte and String Output
13.4	Reading
13.5	Writing
13.6	Pretty Printing
13.7	Reader Extension
13.8	Printer Extension
13.9	Serialization
13.10	Fast-Load Serialization

13.1.1	Encodings and Locales
13.1.2	Managing Ports
13.1.3	Port Buffers and Positions
13.1.4	Counting Positions, Lines, and Columns
13.1.5	File Ports
13.1.6	String Ports
13.1.7	Pipes
13.1.8	Structures as Ports
13.1.9	Custom Ports
13.1.10	More Port Constructors, Procedures, and Events