Nergahak ManpageViewer

UTF8(7) FreeBSD Miscellaneous Information Manual UTF8(7) NAME utf8 - UTF-8 text encoding DESCRIPTION UTF-8 is a multibyte character encoding for Unicode text. It is the preferred format for non ASCII text. Unicode codepoints are encoded as follows: U+0000 - U+007F: One byte: 0....... (compatible with ASCII) U+0080 - U+07FF: Two bytes: 110..... 10...... U+0800 - U+D7FF and U+E000 - U+FFFF: Three bytes: 1110.... 10...... 10...... U+10000 - U+10FFFF: Four bytes: 11110... 10...... 10...... 10...... The bits shown as dots contain the codepoint represented as a binary integer. Bytes starting with the bit pattern 11...... are called UTF-8 start bytes, and those starting with 10...... UTF-8 continuation bytes. The number of leading 1 bits in a start byte indicates the total number of bytes used to encode the codepoint, including the start byte. Encodings using more bytes than required are invalid. In particular, 11000000 and 11000001 are not valid start bytes, the byte after 11100000 must be at least 10100000, and the byte after 11110000 must be at least 10010000. The ranges U+D800 to U+DFFF and U+110000 to U+1FFFFF do not contain valid Unicode codepoints. Consequently, the corresponding three- and four-byte UTF-8 sequences are invalid. The highest valid byte after 11101101 is 10011111, the highest valid byte of the form 1111.... is 11110100, and the highest valid byte after 11110100 is 10001111. To summarize, the following is a complete list of bytes that are invalid in all contexts: c0-c1 two-byte sequence that has to be encoded as a single byte f5-f7 four-byte sequence beyond the Unicode range f8-ff invalid sequence of five or more bytes The following is a complete list of invalid two-byte combinations of the form 11...... 10...... that consist of two valid bytes: e080-e09f three-byte sequence that has to be encoded as two bytes eda0-edbf start of a UTF-16 surrogate, which is not valid UTF-8 f080-f08f four-byte sequence that has to be encoded as three bytes f490-f4bf four-byte sequence beyond the Unicode range SEE ALSO locale(1), ascii(7) STANDARDS F. Yergeau, UTF-8, a transformation format of ISO 10646, RFC 3629, November 2003. The Unicode Standard: https://www.unicode.org/versions/latest/ The Unicode Character Database: https://www.unicode.org/reports/tr44/ FreeBSD 14.1-RELEASE-p8 February 18, 2022 FreeBSD 14.1-RELEASE-p8

Navigation Options

Actions: [Home] [Back] [New Search]

Browse: [Browse OpenBSD] [Section 7]

Print/Export: [Print] [Raw Text]

* UNIX MANUAL PAGE BROWSER *

Navigation

Directory Browser

Manual Page Search

Manual Page Result

Navigation Options

*** UNIX MANUAL PAGE BROWSER ***

Navigation

Directory Browser

Manual Page Search

Manual Page Result

Navigation Options

* UNIX MANUAL PAGE BROWSER *