是因为好奇。我们经常可以看到各种网页头部会这么写

<html lang="zh-CN">
...

那么这么写到底为了表达什么呢?后面的属性怎么来的呢?他的格式到底怎么定义的呢?

带着这些问题,我本意是先去查下知乎的。想想牛人那么多,肯定有个完整的解答。查询的结果完全出乎我的意料啊,完全被整蒙了。好奇的同学可以去围观下网页头部的声明应该是用 lang="zh" 还是 lang="zh-cn"?

还是自己去查阅HTML5官方文档,有这么一段描述。

The lang global attribute participates in defining the language of the element, the language that its non-editable elements are written in or the language that the editable elements should be written in. The tag contains one single entry value in the format defined in the Tags for Identifying Languages (BCP47) IETF document. If the tag content is the empty string the language is set to unknown; if the tag content is not valid, regarding BCP47, it is set to invalid.

可以看到,这个属性用于定义所有可编辑或者非可编辑元素的语言。他的值格式在BCP47中进行了定义。咳,没有写他的内容在哪里,并没有得到我需要的答案,那只能接下去继续看BCP47。

在BCP47中找到如下语法定义描述,使用的是ABNF[RFC5234]语法:

 Language-Tag  = langtag              ; normal language tags
               / privateuse          ; private use tag
               / grandfathered       ; grandfathered tags

langtag       = language
                 ["-" script]
                 ["-" region]
                 *("-" variant)
                 *("-" extension)
                 ["-" privateuse]

 language      = 2*3ALPHA            ; shortest ISO 639 code
                 ["-" extlang]       ; sometimes followed by
                                     ; extended language subtags
               / 4ALPHA              ; or reserved for future use
               / 5*8ALPHA            ; or registered language subtag

 extlang       = 3ALPHA              ; selected ISO 639 codes
                 *2("-" 3ALPHA)      ; permanently reserved

 script        = 4ALPHA              ; ISO 15924 code

 region        = 2ALPHA              ; ISO 3166-1 code
               / 3DIGIT              ; UN M.49 code

 variant       = 5*8alphanum         ; registered variants
               / (DIGIT 3alphanum)

 extension     = singleton 1*("-" (2*8alphanum))

                                     ; Single alphanumerics
                                     ; "x" reserved for private use
 singleton     = DIGIT               ; 0 - 9
               / %x41-57             ; A - W
               / %x59-5A             ; Y - Z
               / %x61-77             ; a - w
               / %x79-7A             ; y - z

 privateuse    = "x" 1*("-" (1*8alphanum))

 grandfathered = irregular           ; non-redundant tags registered
               / regular             ; during the RFC 3066 era

 irregular     = "en-GB-oed"         ; irregular tags do not match
               / "i-ami"             ; the 'langtag' production and
               / "i-bnn"             ; would not otherwise be
               / "i-default"         ; considered 'well-formed'
               / "i-enochian"        ; These tags are all valid,
               / "i-hak"             ; but most are deprecated
               / "i-klingon"         ; in favor of more modern
               / "i-lux"             ; subtags or subtag
               / "i-mingo"           ; combination
              / "i-navajo"
               / "i-pwn"
               / "i-tao"
               / "i-tay"
               / "i-tsu"
               / "sgn-BE-FR"
               / "sgn-BE-NL"
               / "sgn-CH-DE"

 regular       = "art-lojban"        ; these tags match the 'langtag'
               / "cel-gaulish"       ; production, but their subtags
               / "no-bok"            ; are not extended language
               / "no-nyn"            ; or variant subtags: their meaning
               / "zh-guoyu"          ; is defined by their registration
               / "zh-hakka"          ; and all of these are deprecated
               / "zh-min"            ; in favor of a more modern
               / "zh-min-nan"        ; subtag or sequence of subtags
               / "zh-xiang"

 alphanum      = (ALPHA / DIGIT)     ; letters and numbers

我们逐步来解析下整个语法:

Language-Tag
langtag               ; normal language tags
/ privateuse          ; private use tag
/ grandfathered       ; grandfathered tags

上述语法说明Language-Tag有三种定义,其后两种中privateuse是指个人使用;grandfathered定义了已经非法的内容或者合法但不再建议使用;只有langtag是标准语言标签,所以我们只需要关心langtag的定义就行了。

langtag
language
["-" script]
["-" region]
*("-" variant)
*("-" extension)
["-" privateuse]
  • 说明langtag是由六个部分组成,其中后五个部分为可选
language
2*3ALPHA            ; shortest ISO 639 code
["-" extlang]       ; sometimes followed by
                    ; extended language subtags
/ 4ALPHA            ; or reserved for future use
/ 5*8ALPHA          ; or registered language subtag

到这里终于看到了第一个内容定义shortest ISO 639 Code,长度为2-3位,后可以跟扩展语言子标签。注意Shortest这个词,也就是在标准中先挑选ISO639-1,没有的话再选择ISO639-2、ISO639-3。

extlang
3ALPHA              ; selected ISO 639 codes
*2("-" 3ALPHA)      ; permanently reserved

扩展码是selected ISO 639,长度为3位。也就意味着选择为ISO639-2、ISO639-3范围。

script
 4ALPHA              ; ISO 15924 code

这个我想已经不用解释了,ISO 15924 code

region
2ALPHA              ; ISO 3166-1 code
/ 3DIGIT            ; UN M.49 code

variantextension,***privateuse***这三部分均为格式定义,并无相关定义的内容

关于格式中字母的大小写
   These conventions include:

   o  [ISO639-1] recommends that language codes be written in lowercase
      ('mn' Mongolian).

   o  [ISO15924] recommends that script codes use lowercase with the
      initial letter capitalized ('Cyrl' Cyrillic).

   o  [ISO3166-1] recommends that country codes be capitalized ('MN'
      Mongolia).

这段内容说明了推荐的格式写法,ISO639-1使用小写,ISO15924使用首字母大写,ISO3166-1使用全部大写

所以,可以得到结论了

zh,zh-CN,zh-Hans-CN 这些都是完全符合规范的...

关于ISO 639

ISO 639 is a set of standards by the International Organization for Standardization that is concerned with representation of names for languages and language groups.

ISO639是一个标准集合,内容见下表。

Current and historical parts of the standard
| Standard | Name(Codes for the representation of names of language) | | --- | --- | | ISO 639-1 | Part 1: Alpha-2 code | | ISO 639-2 | Part 2: Alpha-3 code | | ISO 639-3 | Part 3: Alpha-3 code for comprehensive coverage of languages | | ISO 639-4 | Part 4: Implementation guidelines and general principles for language coding | | ISO 639-5 | Part 5: Alpha-3 code for language families and groups | | ISO 639-6 | Part 6: Alpha-4 representation for comprehensive coverage of language variants |

后记

BCP47是一份有80多页的文档,这里当然不会全部解释一遍,有兴趣的同学还是建议去整个看下,写的十分清楚细致。因为采用了ABNF的语法记法,所以这个如果不理解,那么要明白语法定义规则就够呛了。我无法理解在知乎上争执的同学,因为标准在那里,只要认真地读一遍,所谓歧义觉得是不应该有的。