2

I have a large dataset with over 2 million rows of textual data. Now I want to remove the accents from the strings.

In the link below, two different modules are described to remove the accents:

What is the best way to remove accents in a Python unicode string?

The modules described are unicode and unicodedata. To me it's not clear what the differences are between the two and a comparison is hard, because I don't have many rows with accents and I don't know what accents might be replaced and which ones are not.

Therefore, I would like to know what the differences are between the two and which one is recommended to use.

Emil
  • 1,531
  • 3
  • 22
  • 47
  • 3
    The answers you found are well-meaning, but wrong. Removing accents is X in a classical [XY problem](http://enwp.org/XY_problem). If you would say what you really want to achieve by doing so, a Unicode expert could tell you how to solve the problem properly. – daxim May 08 '19 at 15:20

1 Answers1

2

There is only one module: unicodedata, which includes the unicode database, so the names and properties of unicode code points.

unicode was a built-in function in Python 2. This function just convert strings to unicode strings, so it was just the encoding, no need to store all the data. On python3 all strings are unicode (with some particularities). Just the encoding now should be defined explicitly.

On that answer, you see only import unicodedata, so only one module. To remove accents, you do no need just unicode code point, but also information about the type of a unicode code point (combining character), so you need unicodedata.

Maybe you mean unidecode. This is a special module, but outside standard library. It could be useful for some uses. The modules is simple and give only results in ASCII domain. This could be ok on some cases, but it could cause problems outside Latin writing system.

On the other hand, unicodedata do nothing for you. You should understand unicode and apply the right filter function (and maybe knowing how other languages works).

So it depends on the case, and maybe you need just other slug functions (to create non escaped string). When workign with languages, you should care not to overdo things (you may built an offensive word).

Giacomo Catenazzi
  • 8,519
  • 2
  • 24
  • 32
  • Right. My eyes were also confused. – Giacomo Catenazzi May 08 '19 at 15:19
  • 1
    Strictly speaking, `unicode` is a *type* in Python 2, and like any type, can be called to produce a value of that type. – chepner May 08 '19 at 15:20
  • The `unidecode` module does quite a good job for non-Latin scripts – it produces a transliteration, eg. "αλφα" → "alpha". It's not the right tool if you want "ἄλφα" → "αλφα" though. – lenz May 08 '19 at 19:12