字符

2022.06.17

本文按照YAML官方文档1.2.2版本翻译总结而成

https://yaml.org/spec/1.2.2/#chapter-5-character-productions

5.1. Character Set

To ensure readability, YAML streams use only the printable subset of the Unicode character set. The allowed character range explicitly excludes the C0 control block15 x00-x1F (except for TAB x09, LF x0A and CR x0D which are allowed), DEL x7F, the C1 control block x80-x9F (except for NEL x85 which is allowed), the surrogate block16 xD800-xDFFF, xFFFE and xFFFF.

On input, a YAML processor must accept all characters in this printable subset.

On output, a YAML processor must only produce only characters in this printable subset. Characters outside this set must be presented using escape sequences. In addition, any allowed characters known to be non-printable should also be escaped.

Note: This isn’t mandatory since a full implementation would require extensive character property tables.

To ensure JSON compatibility, YAML processors must allow all non-C0 characters inside quoted scalars. To ensure readability, non-printable characters should be escaped on output, even inside such scalars.

Note: JSON quoted scalars cannot span multiple lines or contain tabs, but YAML quoted scalars can.

Note: The production name nb-json means “non-break JSON compatible” here.

5.2. Character Encodings

All characters mentioned in this specification are Unicode code points. Each such code point is written as one or more bytes depending on the character encoding used. Note that in UTF-16, characters above xFFFF are written as four bytes, using a surrogate pair.

The character encoding is a presentation detail and must not be used to convey content information.

On input, a YAML processor must support the UTF-8 and UTF-16 character encodings. For JSON compatibility, the UTF-32 encodings must also be supported.

If a character stream begins with a byte order mark, the character encoding will be taken to be as indicated by the byte order mark. Otherwise, the stream must begin with an ASCII character. This allows the encoding to be deduced by the pattern of null (x00) characters.

Byte order marks may appear at the start of any document, however all documents in the same stream must use the same character encoding.

To allow for JSON compatibility, byte order marks are also allowed inside quoted scalars. For readability, such content byte order marks should be escaped on output.

The encoding can therefore be deduced by matching the first few bytes of the stream with the following table rows (in order):

Byte0Byte1Byte2Byte3Encoding
Explicit BOMx00x00xFExFFUTF-32BE
ASCII first characterx00x00x00anyUTF-32BE
Explicit BOMxFFxFEx00x00UTF-32LE
ASCII first characteranyx00x00x00UTF-32LE
Explicit BOMxFExFF UTF-16BE
ASCII first characterx00any UTF-16BE
Explicit BOMxFFxFE UTF-16LE
ASCII first characteranyx00 UTF-16LE
Explicit BOMxEFxBBxBF UTF-8
Default UTF-8

The recommended output encoding is UTF-8. If another encoding is used, it is recommended that an explicit byte order mark be used, even if the first stream character is ASCII.

For more information about the byte order mark and the Unicode character encoding schemes see the Unicode FAQ17.

In the examples, byte order mark characters are displayed as “”.

Example 5.1 Byte Order Mark

⇔# Comment only.# This stream contains no # documents, only comments.
  

Legend:

Example 5.2 Invalid Byte Order Mark

- Invalid use of BOM ⇔ - Inside a document.ERROR: A BOM must not appear inside a document.
  

5.3. Indicator Characters

Indicators are characters that have special semantics.

-” (x2D, hyphen) denotes a block sequence entry.

?” (x3F, question mark) denotes a mapping key.

:” (x3A, colon) denotes a mapping value.

Example 5.3 Block Structure Indicators

sequence: - one - two mapping: ? sky : blue sea : green{ "sequence": [ "one", "two" ], "mapping": { "sky": "blue", "sea": "green" } }
  

Legend:

,” (x2C, comma) ends a flow collection entry.

[” (x5B, left bracket) starts a flow sequence.

]” (x5D, right bracket) ends a flow sequence.

{” (x7B, left brace) starts a flow mapping.

}” (x7D, right brace) ends a flow mapping.

Example 5.4 Flow Collection Indicators

sequence: [ one, two, ] mapping: { sky: blue, sea: green }{ "sequence": [ "one", "two" ], "mapping": { "sky": "blue", "sea": "green" } }
  

Legend:

#” (x23, octothorpe, hash, sharp, pound, number sign) denotes a comment.

Example 5.5 Comment Indicator

# Comment only.# This stream contains no # documents, only comments.
  

Legend:

&” (x26, ampersand) denotes a node’s anchor property.

*” (x2A, asterisk) denotes an alias node.

The “!” (x21, exclamation) is used for specifying node tags. It is used to denote tag handles used in tag directives and tag properties; to denote local tags; and as the non-specific tag for non-plain scalars.

Example 5.6 Node Property Indicators

anchored: !local &anchor value alias: *anchor{ "anchored": !local &A1 "value", "alias": *A1 }
  

Legend:

|” (7C, vertical bar) denotes a literal block scalar.

>” (x3E, greater than) denotes a folded block scalar.

Example 5.7 Block Scalar Indicators

literal: | some text folded: > some text{ "literal": "some\ntext\n", "folded": "some text\n" }
  

Legend:

'” (x27, apostrophe, single quote) surrounds a single-quoted flow scalar.

"” (x22, double quote) surrounds a double-quoted flow scalar.

Example 5.8 Quoted Scalar Indicators

single: 'text' double: "text"{ "single": "text", "double": "text" }
  

Legend:

%” (x25, percent) denotes a directive line.

Example 5.9 Directive Indicator

%YAML 1.2 --- text"text"
  

Legend:

The “@” (x40, at) and “``” (x60`, grave accent) are reserved for future use.

Example 5.10 Invalid use of Reserved Indicators

commercial-at: @text grave-accent:text `ERROR: Reserved indicators can't start a plain scalar.
  

Any indicator character:

The “[”, “]”, “{”, “}” and “,” indicators denote structure in flow collections. They are therefore forbidden in some cases, to avoid ambiguity in several constructs. This is handled on a case-by-case basis by the relevant productions.

5.4. Line Break Characters

YAML recognizes the following ASCII line break characters.

All other characters, including the form feed (x0C), are considered to be non-break characters. Note that these include the non-ASCII line breaks: next line (x85), line separator (x2028) and paragraph separator (x2029).

YAML version 1.1 did support the above non-ASCII line break characters; however, JSON does not. Hence, to ensure JSON compatibility, YAML treats them as non-break characters as of version 1.2. YAML 1.2 processors parsing a version 1.1 document should therefore treat these line breaks as non-break characters, with an appropriate warning.

Line breaks are interpreted differently by different systems and have multiple widely used formats.

Line breaks inside scalar content must be normalized by the YAML processor. Each such line break must be parsed into a single line feed character. The original line break format is a presentation detail and must not be used to convey content information.

Outside scalar content, YAML allows any line break to be used to terminate lines.

On output, a YAML processor is free to emit line breaks using whatever convention is most appropriate.

In the examples, line breaks are sometimes displayed using the “” glyph for clarity.

Example 5.11 Line Break Characters

| Line break (no glyph) Line break (glyphed)↓"Line break (no glyph)\nLine break (glyphed)\n"
  

Legend:

5.5. White Space Characters

YAML recognizes two white space characters: space and tab.

The rest of the (printable) non-break characters are considered to be non-space characters.

In the examples, tab characters are displayed as the glyph “”. Space characters are sometimes displayed as the glyph “·” for clarity.

Example 5.12 Tabs and Spaces

# Tabs and spaces quoted:·"Quoted →" block:→| ··void main() { ··→printf("Hello, world!\n"); ··}{ "quoted": "Quoted \t", "block": "void main() {\n\tprintf(\"Hello, world!\\n\");\n}\n" }
  

Legend:

5.6. Miscellaneous Characters

The YAML syntax productions make use of the following additional character classes:

A decimal digit for numbers:

A hexadecimal digit for escape sequences:

ASCII letter (alphabetic) characters:

Word (alphanumeric) characters for identifiers:

URI characters for tags, as defined in the URI specification18.

By convention, any URI characters other than the allowed printable ASCII characters are first encoded in UTF-8 and then each byte is escaped using the “%” character. The YAML processor must not expand such escaped characters. Tag characters must be preserved and compared exactly as presented in the YAML stream, without any processing.

The “!” character is used to indicate the end of a named tag handle; hence its use in tag shorthands is restricted. In addition, such shorthands must not contain the “[”, “]”, “{”, “}” and “,” characters. These characters would cause ambiguity with flow collection structures.

5.7. Escaped Characters

All non-printable characters must be escaped. YAML escape sequences use the “\” notation common to most modern computer languages. Each escape sequence must be parsed into the appropriate Unicode character. The original escape sequence is a presentation detail and must not be used to convey content information.

Note that escape sequences are only interpreted in double-quoted scalars. In all other scalar styles, the “\” character has no special meaning and non-printable characters are not available.

YAML escape sequences are a superset of C’s escape sequences:

Escaped ASCII null (x00) character.

Escaped ASCII bell (x07) character.

Escaped ASCII backspace (x08) character.

Escaped ASCII horizontal tab (x09) character. This is useful at the start or the end of a line to force a leading or trailing tab to become part of the content.

Escaped ASCII line feed (x0A) character.

Escaped ASCII vertical tab (x0B) character.

Escaped ASCII form feed (x0C) character.

Escaped ASCII carriage return (x0D) character.

Escaped ASCII escape (x1B) character.

Escaped ASCII space (x20) character. This is useful at the start or the end of a line to force a leading or trailing space to become part of the content.

Escaped ASCII double quote (x22).

Escaped ASCII slash (x2F), for JSON compatibility.

Escaped ASCII back slash (x5C).

Escaped Unicode next line (x85) character.

Escaped Unicode non-breaking space (xA0) character.

Escaped Unicode line separator (x2028) character.

Escaped Unicode paragraph separator (x2029) character.

Escaped 8-bit Unicode character.

Escaped 16-bit Unicode character.

Escaped 32-bit Unicode character.

Any escaped character:

Example 5.13 Escaped Characters

- "Fun with \\" - "\" \a \b \e \f" - "\n \r \t \v \0" - "\ \_ \N \L \P \ \x41 \u0041 \U00000041"[ "Fun with \\", "\" \u0007 \b \u001b \f", "\n \r \t \u000b \u0000", "\u0020 \u00a0 \u0085 \u2028 \u2029 A A A" ]
  

Legend:

Example 5.14 Invalid Escaped Characters

Bad escapes: "\c \xq-"ERROR: - c is an invalid escaped character. - q and - are invalid hex digits.