0

I want to remove accents and punctuation from a dictionary. For example, I want to transform "à l'épreuve" into "a l epreuve". The dictionary is this one: https://www.poltext.org/fr/donnees-et-analyses/lexicoder (.cat). There are explanations for dataframes (Remove accents from a dataframe column in R), but I could not find a way of removing for dictionaries.

My code so far:

dict_lg <- dictionary(file = "frlsd/frlsd.cat", encoding = "UTF-8")

Any suggestion?

I_O
  • 4,983
  • 2
  • 2
  • 15
MG Fern
  • 75
  • 9

2 Answers2

1

This should work:

library(quanteda)
library(stringi)
library(stringr)

dict_lg_ascii <- 
  dict_lg |> 
  rapply(f = \(term) term |>
              ## compose from string utilities as desired       
              stri_trans_general(id = 'Latin-ASCII') |>
              str_replace_all(pattern = '[[:punct:]]', replacement = ' '),
         how = 'replace'
         )

output:

## > dict_lg_ascii
Dictionary object with 2 primary key entries and 2 nested levels.
- [NEGATIVE]:
  - a cornes, a court de personnel , a l etroit, a peine , abais , 
## truncated

from the docs:

Dictionaries can be subsetted using [ and [[, operating the same as the equivalent list operators.

Thus rapply (recursively applying a function over nested lists) works. In this case, we apply stri_trans_general as suggested here.

I_O
  • 4,983
  • 2
  • 2
  • 15
  • 1
    You can tailor your string processing with numerous utilities from {stringi} and {stringr}. I edited the question to include @shghm 's removal of the `[:punct:]` character class, but try yourself for an optimal solution (I have just basic experience with string manipulation). – I_O Jul 20 '23 at 13:15
0

This post might help: Remove all special characters from a string in R?

The stringr-package together with regular expressions are probably a good way to deal with it.

shghm
  • 239
  • 2
  • 8
  • I tried ```str_replace_all(dict_lg, "[[:punct:]]", " ")``` but I received the following error ```string` must be a vector, not a object.``` – MG Fern Jul 20 '23 at 12:27