Fix (convert/drop) invalid UTF-8 characters in R

Asked Jun 16 '20 at 05:10

Active Jun 16 '20 at 05:10

Viewed 537 times

I have an issue with UTF-8 coding in a huge dataframe (millions of rows). I used this question, but I did not fix the issue.

My column (character) is very simple:

Start date
12/01/2019
12/01/2019
12/02/2019

I am trying to convert into date

taxi_2020_test$`Start Date` <- mdy(taxi_2020_test$`Start Date`)

and get this

Error in gsub(reg$alpha_exact[["A"]], "%A", x, ignore.case = T, perl = T) : input string 1 is invalid UTF-8

It is 100% an issue with UTF-8, because in Python I cannot even import this dataset into Jupyter, it gives me an error, again mentioning UTF-8.

How to fix or at least to drop this? I have millions of rows and if it is a small number of bad rows, I am ok with it.

asked Jun 16 '20 at 05:10

Anakin Skywalker

1

Maybe you can find the lines to exclude with `grepl("[^0-9/ ]", taxi_2020_test$"Start Date")` – GKi Jun 16 '20 at 05:49
@GKi, thanks for your suggestion! Did not help, unfortunately. – Anakin Skywalker Jun 16 '20 at 06:25
1

Maybe then try `grep("^\\d+/\\d+/\\d+$", taxi_2020_test$"Start Date")` to find lines to use. – GKi Jun 16 '20 at 06:32
1

Did find the actual bad character(s) in your `start_date` column? – Paul van Oppen Jun 16 '20 at 07:40
I guess so, if I cannot convert it – Anakin Skywalker Jun 16 '20 at 07:45
@GKi, thanks, will try tomorrow! – Anakin Skywalker Jun 16 '20 at 07:45

0 Answers0