4

Is there a way to convert special letters, in a text, to english letters in R? For example:

Æ -> AE
Ø -> O
Å -> A

Edit: The reason I need this convert is R cant see that these two words are the same:

stringdist('oversættelse','oversaettelse')
[1] 2
grepl('oversættelse','oversaettelse')
FALSE

Some people tent to write using only english characters and some others not. In order to compare some texts I need to have them in the 'same format'.

Mpizos Dimitris
  • 4,819
  • 12
  • 58
  • 100
  • 2
    These aren't special letters, these are proper Latin characters available. You don't even need Unicode to process them, they are part of the Latin codepage (eg Windows 1252). Why do you want to map them to different characters? – Panagiotis Kanavos Nov 25 '15 at 07:51
  • This sounds like a case of [the XY Problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) - having a problem with X but asking about the assumed solution Y. *Why* do you want to replace these characters with others? What is the *actual* problem you've encountered? Have you used an R function that mangled the text? Is the source data Unicode or using a specific code page? – Panagiotis Kanavos Nov 25 '15 at 07:54
  • Wouldn't it be more appropriate to convert Å to AA which is the classic convention, and even the character combination that the letter Å originally replaced. – Tommy Andersen Nov 25 '15 at 07:56
  • @AnandaMahto *why* use ASCII when R supports Unicode? It's almost impossible to fix encoding problems by switching codepages, there is *always* one more missed conversion – Panagiotis Kanavos Nov 25 '15 at 07:57
  • @TommyA there is no classic convention. There's nothing wrong with *any* of these characters. In fact, there's no reason to use a specific codepage when Unicode is available. – Panagiotis Kanavos Nov 25 '15 at 07:58
  • @PanagiotisKanavos I think you misunderstand the problem here, the Op is trying to compare a string using the non-english letters with a string where the letters have been converted, since that is sometimes the input. The origin of the letter Å is Aa, which means that often people will write Å as Aa when they do not have access to a keyboard with the letter on, simply out of convenience. – Tommy Andersen Nov 25 '15 at 08:43
  • @TommyA no, I don't misunderstand the issue. In fact, there is no *conversion* involved. This is a matter of collation. For example Chrome *will* match "Ο" whe you search for "Ø", because it uses a collation algorithm that can handle diphthongs. If that isn't possible, the workaround is to perform a transformation as shown by bdecaf's answer. `grepl` doesn't have a collation parameter which means fiddling with the environment or ICU locale is necessary – Panagiotis Kanavos Nov 25 '15 at 09:26

2 Answers2

7

I recently had a very similar problem and was pointed to the question Unicode normalization (form C) in R : convert all characters with accents into their one-unicode-character form?

basically the gist is for many of this special characters there exist more than one unicode representation - which will mess with text comparisons. The suggested solution is to use the stringi package function stri_trans_nfc - it has also a function stri_trans_general that supports transliteration, which might be exactly what you need.

Community
  • 1
  • 1
bdecaf
  • 4,652
  • 23
  • 44
  • 4
    Specifically, something like `stri_trans_general("ØxxÅxxÆabc", "latin-ascii")`. +1 – A5C1D2H2I1M1N2O1R2T1 Nov 25 '15 at 08:58
  • 1
    I wonder whether [icuSetCollate](https://stat.ethz.ch/R-manual/R-devel/library/base/html/icuSetCollate.html) would allow `grepl` to match the strings without transforming them? The help page even includes a Danish example – Panagiotis Kanavos Nov 25 '15 at 09:31
  • @PanagiotisKanavos nice find! This might even help me on my original problem. – bdecaf Nov 25 '15 at 11:30
-2

You can use chartr

x <- "ØxxÅxx"
chartr("ØÅ", "OA", x)
[1] "OxxAxx"

And/or gsub

y <- "Æabc"
gsub("Æ", "AE", y)
[1] "AEabc"
Xavier Nayrac
  • 530
  • 5
  • 16