2

Good morning guys, I was writing a small script to manage the data in R, but, I don't understand why, when I import an huge csv (3.5 gb) file in R, it doesn't work. To solve this problem quickly I decide to use pandas with reticulate.

#Package from python
pd<-import("pandas", as="pd")
#leggo il file csv con pandas
pd$read_csv("C:\\Users\\Befrancesco\\Desktop\\X_dataset\\x_file_name.csv, error_bad_lines= FALSE, encoding = "utf-8" )

R returns me this type of error:

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 105: invalid start byte 

Where I wronge?

Thank you in advance for oyour answer.

Francesco

Earl Mascetti
  • 1,278
  • 3
  • 16
  • 31
  • Could be there's an unreadable character in your input csv in position 105. Have you looked at some of your input data? If your file is very big, you can try something like this in your Windows PowerShell: https://stackoverflow.com/a/36836282/5269252 – meenaparam Jan 23 '20 at 11:44
  • @meenaparam Thank you for your answer. I'd like to know if padas has a option that it is able to skip the wronge lines. :) – Earl Mascetti Jan 23 '20 at 11:49
  • 1
    You're already using the `error_bad_lines` parameter but that isn't helping here. Could your `encoding` be something else here? Try this answer and the one below: https://stackoverflow.com/questions/18171739/unicodedecodeerror-when-reading-csv-file-in-pandas-with-python – meenaparam Jan 23 '20 at 11:55
  • Thank you so much! :) It works! The solution is in the encoding ="ISO-8859-1" – Earl Mascetti Jan 23 '20 at 12:04
  • 1
    Excellent, glad that has solved your problem. I'll put something in the answer box just so that we can mark this question off as closed. – meenaparam Jan 23 '20 at 12:09

1 Answers1

4

It could be that your encoding isn't UTF-8. Try some of the other encodings, such as ISO-8859-1 in your read_csv call e.g.

pd$read_csv("C:\\Users\\Befrancesco\\Desktop\\X_dataset\\x_file_name.csv, error_bad_lines= FALSE, encoding = "ISO-8859-1")

See this answer for more on different encodings: https://stackoverflow.com/a/18172249/5269252

meenaparam
  • 1,949
  • 2
  • 17
  • 29