ref: 3a5ddf0610add4ea0663c4bf2d7f569ac723ad81
dir: /doc/api/libstd/unicode.txt/
{ title: Unicode description: libstd: Unicode } Unicode -------- pkg std = const Badchar : char const Maxcharlen : size const Maxcharval : char /* iterators */ impl iterable chariter -> char const chariter : (byte[:] -> chariter) /* utf8 information */ const charlen : (chr : char -> size) const encode : (buf : byte[:], chr : char -> size) const decode : (buf : byte[:] -> char) const strstep : (str : byte[:] -> (char, byte[:])) /* character class predicates */ const isalpha : (c : char -> bool) const isdigit : (c : char -> bool) const isxdigit : (c : char -> bool) const isnum : (c : char -> bool) const isalnum : (c : char -> bool) const isspace : (c : char -> bool) const isblank : (c : char -> bool) const islower : (c : char -> bool) const isupper : (c : char -> bool) const istitle : (c : char -> bool) /* character class conversions */ const tolower : (c : char -> char) const toupper : (c : char -> char) const totitle : (c : char -> char) ;; Summary ------- As a reminder, Myrddin characters hold a single Unicode codepoint, and all strings are assumed to be encoded in UTF-8 by default. These functions are designed to facilitate manipuating unicode strings and codepoints. The APIs are generally designed that strings will be streamed through, and not encoded or decoded wholesale. Constants --------- const Badchar : char This is a character value that is not, and will never be, a valid unicode codepoint. This is generally returned when we encounter an error fr const Maxcharlen : size This is a constant defining the maximum number of bytes that a character may be decoded into. It's guaranteed that a buffer that is at least Maxcharlen bytes long will be able to contain any character. const Maxcharval : char This is the maximum value that any valid future unicode codepoint may decode into. Any character that is greater than this is an invalid character. Functions: Iterating over strings -------------------------------- impl iterable chariter -> char const chariter : (byte[:] -> chariter) Chariter returns an iterator which steps through a string character by character. Functions: Encoding and Decoding -------------------------------- const charlen : (chr : char -> size) Charlen returns the length in bytes that decoding the character provided into unicode would take. This can vary between 1 and Maxcharlen bytes. const encode : (buf : byte[:], chr : char -> size) Encode takes a single character, and encodes it to a utf8 string. The buffer must be at least long enough to hold the character. Returns: The number of bytes written, or -1 if the character could not be encoded. const decode : (buf : byte[:] -> char) Decode converts the head of the buffer `buf` to a single unicode codepoint, returning the codepoint itself, or `Badchar` if the codepoint is invalid. The tail of the buffer is not considered, allowing this function to be used to peek at the contents of a string. const strstep : (str : byte[:] -> (char, byte[:])) strstep is a function for stepping through unicode encoded strings. It returns the tuple (`Badchar`, str[1:]) if the value cannot be decoded, or `(charval, str[std.charlen(charval):])` therwise. ```{runmyr striter} s = "abcd" while s.len != 0 (c, s) = std.striter(s) std.put("next char is {}\n", s) ;; ``` Character Classes ----------------- const isalpha : (c : char -> bool) const isdigit : (c : char -> bool) const isxdigit : (c : char -> bool) const isnum : (c : char -> bool) const isalnum : (c : char -> bool) const isspace : (c : char -> bool) const isblank : (c : char -> bool) const islower : (c : char -> bool) const isupper : (c : char -> bool) const istitle : (c : char -> bool)