66

I have googled (well, DuckDuckGo'ed, actually) till I'm blue in the face, but cannot find a list of language codes of the type en-GB or fr-CA anywhere.

There are excellent resources about the components, in particular the W3C I18n page, but I was hoping for a simple alphabetical listing, fairly canonical if possible (something like this one). Cannot find.

Can anyone point me in the right direction? Many thanks!

Dɑvïd
  • 1,929
  • 2
  • 16
  • 26
  • The link you provided *is* the official registry. What is missing from that document? – Jörg W Mittag Nov 07 '12 at 11:50
  • @jorg-w-mittag - it might be naive, but I was hoping for a fairly full listing of the common combinations of that type, not simply the isolated sub-tags. – Dɑvïd Nov 07 '12 at 21:35
  • 4
    The most accurate list that i found is [this](https://msdn.microsoft.com/en-us/library/ee825488%28v=cs.20%29.aspx?f=255&MSPPError=-2147217396) – Sergio Flores Jul 04 '16 at 16:59
  • 1
    @byoigres - Helpful list -- I've saved it in the [WaybackMachine](https://web.archive.org/web/20160705080418/https://msdn.microsoft.com/en-us/library/ee825488%28v=cs.20%29.aspx?f=255&MSPPError=-2147217396) for safe keeping. ;) – Dɑvïd Jul 05 '16 at 08:05
  • This keeps happening. Someday I want to bump into you _in person_. – Caleb Sep 27 '19 at 05:35
  • Ditto, @Caleb, ditto. ;) – Dɑvïd Sep 27 '19 at 11:22

8 Answers8

62

There are several language code systems and several region code systems, as well as their combinations. As you refer to a W3C page, I presume that you are referring to the system defined in BCP 47. That system is orthogonal in the sense that codes like en-GB and fr-CA simply combine a language code and a region code. This means a very large number of possible combinations, most of which make little sense, like ab-AX, which means Abkhaz as spoken in Åland (I don’t think anyone, still less any community, speaks Abkhaz there, though it is theoretically possible of course).

So any list of language-region combinations would be just a pragmatic list of combinations that are important in some sense, or supported by some software in some special sense.

The specifications that you have found define the general principles and also the authoritative sources on different “subtags” (like primary language code and region code). For the most important parts, the official registration authority maintains the three- and two-letter ISO 639 codes for languages, and the ISO site contains the two-letter ISO 3166 codes for regions. The lists are quite readable, and I see no reason to consider using other than these primary resources, especially regarding possible changes.

BenMorel
  • 34,448
  • 50
  • 182
  • 322
Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390
  • 3
    Thanks for the full explanation: you clearly understood my question, and explained why I (probably) won't find the answer I was hoping for! That itself is good to know. – Dɑvïd Nov 07 '12 at 21:37
  • It would be really great to have a canonical list of combinations that make sense for those of us who don't even know what Abkhaz and Åland are. Too bad this doesn't exist. – melinath Oct 21 '15 at 18:44
  • They may be readable, but they are also quite insufficient for many language tagging needs. Personally I'm hoping that the language listing in development at glottolog.org becomes a new standard… –  Dec 28 '15 at 20:36
  • 7
    Looks like the first link is broken (I can't connect to http://www.inter-locale.com) – cat Dec 05 '16 at 19:13
  • 1
    FWIW, the IANA registry that the OP mentioned is part of BCP 47 ([section 3](https://tools.ietf.org/html/bcp47#section-3)). Also, WRT "I see no reason to consider using other …," you may need to express writing scripts (ISO 15924) for traditional vs simplified Chinese (`cmn-Hant` vs `cmn-Hans`) or Latin vs Cyrillic in Serbian (`sr-Latn` vs `sr-Cyrl`), or you may want to refer to Spanish common to all of Latin America (`es-419`) which relies on UN M.49 codes. – Jon Wolski Feb 14 '19 at 18:05
  • Note: The three letter ISO639-2 codes (as maintained by loc.gov, second link above) are not used in BCP 47. BCP 47 uses ISO639-3 codes for its 3 letter codes; this registry is maintained at https://iso639-3.sil.org/. – Marc Durdin Apr 22 '19 at 08:36
15

There are 2 components in play here :

  1. The language tag which is generally defined by ISO 639-1 alpha-2
  2. The region tag which is generally defined by ISO 3166-1 alpha-2

You can mix and match languages and regions in whichever combination makes sense to you so there is no list of all possibilities.

BTW, you're effectively using a BCP47 tag, which defines the standards for each locale segment.

tigrish
  • 2,488
  • 18
  • 21
  • 1
    "...there is no list of all possibilities." More or less what I've worked out, and this is the "executive" summary of Jukka's fuller explanation, I suppose. Still seems to me a list of the *common* combinations might be a helpful thing to have available, but OTOH, it seems like I might be a bit isolated in feeling that way! :) – Dɑvïd Nov 07 '12 at 21:39
8

Unicode maintains such a list : http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/index.html Even better, you can have it in an XML format (ideal to parse the list) and with also the usual writing systems used by each language : http://unicode.org/repos/cldr/trunk/common/supplemental/supplementalData.xml (look in /LanguageData)

  • @s-f These [links are available](https://web.archive.org/web/20150730204944/http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/index.html) in the Wayback Machine (fortunately). – Dɑvïd Jul 21 '18 at 08:18
  • The [*Likely Subtags*](http://www.unicode.org/cldr/charts/latest/supplemental/likely_subtags.html) page may prove useful too. It provides the most likely language and script for a given region, and vice versa. – Fabien Snauwaert Mar 23 '19 at 22:33
3

One solution would be to parse this list, it would give you all of the keys needed to create the list you are looking for.

http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

Adam
  • 131
  • 1
  • 2
3

I think you can take it from here http://www.unicode.org/cldr/charts/latest/supplemental/territory_language_information.html

s-f
  • 2,091
  • 22
  • 28
3

List of primary language subtags, with common region subtags for each language (based on population of language speakers in each region):

https://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html

For example, for English:

  • en-US (320,000,000)
  • en-IN (250,000,000)
  • en-NG (110,000,000)
  • en-PK (100,000,000)
  • en-PH (68,000,000)
  • en-GB (64,000,000)

(Jukka K. Korpela and tigrish give good explanations for why any combination of language + region code is valid, but it might be helpful to have a list of codes most likely to be in actual use. s-f's link has such useful information sorted by region, so it might also be helpful to have this information sorted by language.)

Chris Tollefson
  • 375
  • 1
  • 5
  • 9
  • Thanks for posting -- As this list is arranged in `{language} {country}` order, IMO this makes most sense as is the very easy and intuitive to convert this to BCP47 – Martin Jul 17 '23 at 21:51
3

This can be found at Unicode's Common Locale Data Repository. Specifically, a JSON file of this information is available in their cldr-json repo

Brice
  • 940
  • 9
  • 22
  • This should be the link: https://github.com/unicode-org/cldr-json/blob/main/cldr-json/cldr-localenames-full/main/en/languages.json or https://github.com/unicode-org/cldr-json/blob/main/cldr-json/cldr-localenames-modern/main/en/languages.json – Fawaz Ahmed Apr 25 '22 at 22:50
2

We have a working list that we work off of for language code/language name referencing for Localizejs. Hope that helps

List of Language Codes in YAML or JSON?

johnnywu
  • 235
  • 1
  • 4
  • 18