1

I have a huge data frame, with some columns containing "characters". The problem is that I have some "wrong" characters, like this:

mutate_all(data, funs(tolower))

> Error in mutate_impl(.data, dots) :    Evaluation error: invalid input
> 'https://www.ps.f/c-w/nos-promions/v-ambght-rembment.html#modalit<e9>s'
> in 'utf8towcs'.

So I deleted the "wrong" characters (note: I can't just easily remove all the characters, because I need the ":" to separate the data).

I found an solution:

library(qdap) 
keep <- c(":") 
data$column <- strip(data$column, keep, lower = TRUE) 

See: How to remove specific special characters in R

That worked... but it is really slow. So therefore my question: how can I apply a function on all my columns (columns that are character) which is quicker then what I just did?

EDIT

Some example what happened in my script:

View(data$column)
"CP:main:234e5qhaw/00:lcd-monitor-with-smatimge-lite"                                               
"CP:main:234e5qhaw/00:lcd-monitor-with-smarimge-lite"                                               
"CP:main:234e5qhaw/00:lcd-monitor-with-sartimge-lite"
"CP:main:bri953/00:faq:skça_sorulan_sorular:xc000003329:f02:9044:9512"

tolower(data$column) 
Error in tolower(data$column) :
invalid input "CP:main:bri953/00:faq:skça_sorulan_sorular:xc000003329:f02:9044:9512" in 'utf8towcs'

Optimal situation: keep as much as possible from the original data. But I can imagine that "special" characters must be replaced. But I really need to keep the ":" to separate the data in a later stage.

R overflow
  • 1,292
  • 2
  • 17
  • 37
  • 2
    Can you add an example to clarify what the input and expected output is? – Val Mar 14 '18 at 09:13
  • 3
    `tolower('https://www.ps.f/c-w/nos-promions/v-ambght-rembment.html#modalits')` works fine for me. What is the issue exactly? Please provide a reproducible example. – Axeman Mar 14 '18 at 09:17
  • Sure guys. Have updated my question. – R overflow Mar 14 '18 at 09:27
  • 1
    Seems to be an encoding issue ... maybe [this](https://datascience.stackexchange.com/questions/6115/how-to-convert-a-text-to-lower-case-using-tm-package) or [this](https://datascience.stackexchange.com/questions/6115/how-to-convert-a-text-to-lower-case-using-tm-package) can help you – Val Mar 14 '18 at 09:30
  • Check `Encoding(data$column)`, and try converting to a different encoding (possibly with `enc2utf8`). – Axeman Mar 14 '18 at 10:00
  • Encoding(data$column) leads to: "unknown". Tried to convert it with: Encoding(data$column) <- enc2utf8 but received an error : a character vector 'value' expected. Now running: keeps <- c(":") new <- as.data.frame(lapply(data,function(x) { if(is.character(x)) strip(data, keeps, lower = TRUE) else x } )) But it is running for a while now... so that is the reason that I wanted to have a faster code :-) – R overflow Mar 14 '18 at 10:23
  • You could try transliterating everything to ASCII with `iconv(x,from="UTF-8",to="ASCII//TRANSLIT")` – Andrew Gustar Mar 14 '18 at 10:44
  • Thank you @AndrewGustar. But that is a bit slow.. (I applied it on my whole data frame). Or is this the quickest way? – R overflow Mar 14 '18 at 11:54
  • No, `enc2utf8` is a function. I.e. `df$column <- enc2utf8(df$column)`, which should be quicker. – Axeman Mar 14 '18 at 18:21

0 Answers0