13

I am looking for some statistical data on the usage of Unicode characters in textual documents (with any markup). Googling brought no results.

Background: I am currently developing a finite state machine-based text processing tool. Statistical data on characters might help searching for the right transitions. For instance latin characters are probably most used so it might make sense to check for those first.

Did anyone by chance gathered or saw such statistics?

(I'm not focused on specific languages or locales. Think general-purpose parser like an XML parser.)

lexicore
  • 42,748
  • 17
  • 132
  • 221
  • 3
    You need to state the domains or kinds of texts in where you’re searching. There are lots of different text corpora. The statistics will be wildly different when comparing law texts with maths papers. However, I don’t know a by-the-character analysis off the top of my head. – Boldewyn Mar 07 '14 at 09:13
  • Also you could try your luck on the Linuistics StackExchange, http://linguistics.stackexchange.com/. – Boldewyn Mar 07 '14 at 09:14
  • you mean, statistics of character usage over Unicode-encoded documents? Or are you using "Unicode character" in the popular sense of "strange-looking character"? – Walter Tross Mar 07 '14 at 12:39
  • @Boldewyn: I don't have a specific domain. Let's take arbitrary XML documents, for instance. Knowing how characters are distributed might help to develop a better parser. I have a similar task. – lexicore Mar 07 '14 at 17:39
  • @WalterTross: I mean characters in Unicode-encoded documents, not just "strange-looking". Like, if one'd take all the HTML documents in Unicode from the whole Internet, throw away all the markup, and count character occurences divided by the total number of characters, what the rates of individual characters would be? – lexicore Mar 07 '14 at 17:44
  • a parser like yours should take advantage of the knowledge of the document's language, e.g., looking at the `lang` attribute of the `html` tag, and/or, it should take advantage of what text it already has seen. The [Unicode "block name" property](http://en.wikipedia.org/wiki/Unicode_block) of characters is probably useful. – Walter Tross Mar 07 '14 at 18:57
  • 3
    (If you'd do the "all the web sans HTML" thing naively, U+0020 followed by U+000A would be the most popular.) For a quick sample, you could use a Wikipedia dump, with all languages included. Chinese will be under-represented, emojis, too, (think chat protocols), but it should be a good start. – Boldewyn Mar 07 '14 at 21:39
  • @Boldewyn Analyzing the Wikipedia dump ist not a bad idea. This would give very good results for my purpose. – lexicore Mar 08 '14 at 20:39
  • 2
    This may be of use: http://stackoverflow.com/questions/5567249/what-are-the-most-common-non-bmp-unicode-characters-in-actual-use – Arvindh Mani Mar 11 '14 at 09:12
  • The [CommonCrawl data](http://commoncrawl.org/) would probably be better suited than Wikipedia. – nwellnhof Mar 11 '14 at 10:44
  • I can't give you a statistic about which text characters are used the most. But, maybe the opposite. There is a lot of free space in the [Unicode](http://en.wikipedia.org/wiki/Unicode) value range. Meaning, there are a lot of numbers in the value range 0..2^32 that have not yet been assigned to represent anything. You could find out which numbers are not text characters and therefore not used at all. – Sascha Wedler Mar 14 '14 at 04:51
  • Object to closure just because some people don't understand the problem/relevance: the help centre gives four alternatives (all of which are met here) as to the kind of question that is appropriate when specific source code is not: "We feel the best Stack Overflow questions have a bit of source code in them, but if your question generally covers… a specific programming problem, or a software algorithm, or software tools commonly used by programmers; and is a practical, answerable problem that is unique to software development … then you’re in the right place to ask your question!" – David M W Powers May 24 '16 at 00:26
  • viz. programming problem: parsing (marked up) text; software algorithm: statistical transition rules for finite state machine - _stack_ overflow gurus should know about FSAs and _PDAs_ and their probabilistic variants); software tools: "text processing tool"; practical answerable problem: "statistical data" representative of the corpus of marked up text documents (only obtainable using programs to analyse the corpus, with the supply or pointer to such a program and corpus being a reasonable answer); "unique to software development": the collection of Ngrams is fundamental to Comp.Linguistics. – David M W Powers May 24 '16 at 00:44
  • @DavidMWPowers Thank you for your support, but I won't worry much about closure of this question. The point is made and I got the [data](https://docs.google.com/spreadsheet/ccc?key=0AjHWiIkH6KdCdDd1TnppTnZub1k2MTNhV05xdk5yUXc&usp=sharing). – lexicore May 24 '16 at 21:48

2 Answers2

5

To sum up current findings and ideas:

  • Tom Christiansen gathered such statistics for PubMed Open Access Corpus (see this question). I have asked if he could share these statistics, waiting for the answer.
  • As @Boldewyn and @nwellnhof suggested, I could run the analysis of the complete Wikipedia dump or CommonCrawl data. I think these are good suggestions, I'll probably go with the CommonCrawl.

So sorry, this is not an answer, but a good research direction.

UPDATE: I have written a small Hadoop job and ran it on one of the CommonCrawl segments. I have posted my results in a spreadsheet here. Below are the first 50 characters:

0x000020    14627262     
0x000065    7492745 e
0x000061    5144406 a
0x000069    4791953 i
0x00006f    4717551 o
0x000074    4566615 t
0x00006e    4296796 n
0x000072    4293069 r
0x000073    4025542 s
0x00000a    3140215 
0x00006c    2841723 l
0x000064    2132449 d
0x000063    2026755 c
0x000075    1927266 u
0x000068    1793540 h
0x00006d    1628606 m
0x00fffd    1579150 
0x000067    1279990 g
0x000070    1277983 p
0x000066    997775  f
0x000079    949434  y
0x000062    851830  b
0x00002e    844102  .
0x000030    822410  0
0x0000a0    797309  
0x000053    718313  S
0x000076    691534  v
0x000077    682472  w
0x000031    648470  1
0x000041    624279  @
0x00006b    555419  k
0x000032    548220  2
0x00002c    513342  ,
0x00002d    510054  -
0x000043    498244  C
0x000054    495323  T
0x000045    455061  E
0x00004d    426545  M
0x000050    423790  P
0x000049    405276  I
0x000052    393218  R
0x000044    381975  D
0x00004c    365834  L
0x000042    353770  B
0x000033    334689  E
0x00004e    325299  N
0x000029    302497  /
0x000028    301057  (
0x000035    298087  5
0x000046    295148  F

To be honest, I have no idea if these results are representative. As I said, I only analysed one segment. Looks quite plausible for me. One can also easily spot that the markup is already stripped off - so the distribution is not directly suitable for my XML parser. But it gives valuable hints on which character ranges to check first.

Community
  • 1
  • 1
lexicore
  • 42,748
  • 17
  • 132
  • 221
  • 1
    This is typical English character frequency, so not much different from ASCII or Latin-1. OP wasn't clear whether they were interested only in English or all usage. – hippietrail Mar 17 '14 at 02:24
  • @hippietrail: I think I was clear that "I'm not focused on specific languages or locales." True, topmost chars are like in English. Other alphabets come later. However I'm not quite sure, how "random" my segment/file in CommonCrawl was. Would make sense to analyse more segments. – lexicore Mar 17 '14 at 08:25
  • Oh I didn't just mean the Latin character set is higher than Chinese or Arabic, I mean the actual order of the letters is English too rather than any of the other many languages that use mostly the same alphabet. On closer look maybe it is a tiny bit different. Typical for English is `e` `t` `a` `o` `n` `r` `i` `s` `h`. But as for Unicode there's no characters with funny accents or special symbols. The only things more Unicodeish than plain ASCII are the no-break space and 0xfffd. – hippietrail Mar 17 '14 at 08:59
1

The link to http://emojitracker.com/ in the near-duplicate question I personally think is the most promising resource for this. I have not examined the sources (I don't speak Ruby) but from a real-time Twitter feed of character frequencies, I would expect quite a different result than from static web pages, and probably a radically different language distribution (I see lots more Arabic and Turkish on Twitter than in my otherwise ordinary life). It's probably not exactly what you are looking for, but if we just look at the title of your question (which probably most visitors will have followed to get here) then that is what I would suggest as the answer.

Of course, this begs the question what kind of usage you attempt to model. For static XML, which you seem to be after, maybe the Common Crawl set is a better starting point after all. Text coming out of an editorial process (however informal) looks quite different from spontaneous text.

Out of the suggested options so far, Wikipedia (and/or Wiktionary) is probably the easiest, since it's small enough for local download, far better standardized than a random web dump (all UTF-8, all properly tagged, most of it properly tagged by language and proofread for markup errors, orthography, and occasionally facts), and yet large enough (and probably already overkill by an order of magnitude or more) to give you credible statistics. But again, if the domain is different than the domain you actually want to model, they will probably be wrong nevertheless.

Community
  • 1
  • 1
tripleee
  • 175,061
  • 34
  • 275
  • 318
  • I became quite interested in CommonCrawl, this will be new experience with Hadoop and MapReduce etc. You are correct, I am developing a state-machine-based XML parser (yeah, in 2014) in particular but interested in state-machine-based parsers in general. Twitter und emojis is not exactly what I'm looking for. Wikipedia dn CommonCrawl seem to be a much better fit. – lexicore Mar 14 '14 at 09:21
  • Let me reiterate, in more words: If you have not tried to extract text from a large number of real-world web pages out there, you are in for a dauntingly complex task (or less than perfect results). It's not that the problem is complex (well, that too; there are undefined areas, ambiguities, and contradictions in the complex maze of specifications that is the full World Wide Web stack) but that clumsy humans and badly interfaced components produce encoding errors, markup errors, and other dirt which will dominate over many of the interesting phenomena you wanted to extract. – tripleee Mar 14 '14 at 10:19
  • By the way, the emojis are not why the Twitter feed is interesting. They are a fun demo, but tangential at best. – tripleee Mar 14 '14 at 10:20
  • I understand your concern, but I wouldn't overcomplicate it here. Since I'm after XML Grammar in first place, I don't actually have to extract the text from HTML. I can take it all with the markup a is. Just count ALL the characters, no matter if this is markup or text content. I would even say these results would suit be even better than just the text content. The only thing is encoding. I'll have to detect the encoding somehow - otherwise I won't be able to read correct characters from bytes. However I saw a number of existing approaches to this, so I think it is solveable. – lexicore Mar 14 '14 at 12:11
  • Here's a few approaches: http://stackoverflow.com/questions/9181530/auto-detect-character-encoding-in-java http://stackoverflow.com/questions/774075/character-encoding-detection-algorithm So I don't think I even have to invent something here. – lexicore Mar 14 '14 at 12:12
  • Go for it, cowboy! Interested to see the results. – tripleee Mar 14 '14 at 13:35
  • 1
    Here you go: https://docs.google.com/spreadsheet/ccc?key=0AjHWiIkH6KdCdDd1TnppTnZub1k2MTNhV05xdk5yUXc&usp=sharing – lexicore Mar 15 '14 at 00:11