Invalid multibyte string in read.csv

Question

I am trying to import a csv that is in Japanese. This code:

url <- 'http://www.mof.go.jp/international_policy/reference/itn_transactions_in_securities/week.csv'
x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE)

returns the following error:

Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) : 
invalid multibyte string at '<91>ΊO<8b>y<82>ёΓ<e0><8f>،<94><94><84><94><83><8c>_<96>̏@(<8f>T<8e><9f><81>E<8e>w<92><e8><95>@<8a>փx<81>[<83>X<81>j'

I tried changing the encoding (Encoding(url) <- 'UTF-8' and also to latin1) and tried removing the read.csv parameters, but received the same "invalid multibyte string" message in each case. Is there a different encoding that should be used, or is there some other problem?

Have you tried to set the argument `encoding="UTF-8"` to `read.csv()`? — Andrie, Jan 16 '13 at 16:32

Joshua Ulrich · Accepted Answer · 2013-01-16T16:50:19.770

108

Encoding sets the encoding of a character string. It doesn't set the encoding of the file represented by the character string, which is what you want.

This worked for me, after trying "UTF-8":

x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE, fileEncoding="latin1")

And you may want to skip the first 16 lines, and read in the headers separately. Either way, there's still quite a bit of cleaning up to do.

x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE,
  fileEncoding="latin1", skip=16)
# get started with the clean-up
x[,1] <- gsub("\u0081|`", "", x[,1])    # get rid of odd characters
x[,-1] <- as.data.frame(lapply(x[,-1],  # convert to numbers
  function(d) type.convert(gsub(d, pattern=",", replace=""))))

edited Jan 16 '13 at 16:50

answered Jan 16 '13 at 16:39

Joshua Ulrich

173,410
32
338
418

Thanks. From [this question](http://stackoverflow.com/questions/11069908/r-extracting-clean-utf-8-text-from-a-web-page-scraped-with-rcurl) I tried setting the local to japanese with `Sys.setlocale` but that didn't work either ("OS reports request to set locale to "japanese" cannot be honored"). – jaredwoodard Jan 16 '13 at 17:06
Yes, read.csv("foobar.csv", fileEncoding = "latin1") worked for me. I had an Excel file and saved as CSV, then had to set the fileEncoding to "latin1" to read that CSV in R. – Dan Jarratt Apr 26 '17 at 19:17
@Joshua Ulrich, what if my code looks like this? `file.list <- list.files(pattern = '*.txt') file.list <- file.list[order(nchar(file.list), file.list)] df.list <- lapply(file.list, read_file) df_virgi <- do.call(rbind.data.frame, df.list)` where shall I place **fileEncoding = "latin1"? Thanks a lot! – Rollo99 Nov 08 '19 at 10:18

score 17 · Answer 2 · edited Feb 23 '18 at 15:44

17

You may have encountered this issue because of the incompatibility of system locale try setting the system locale with this code Sys.setlocale("LC_ALL", "C")

edited Feb 23 '18 at 15:44

dpel

1,954
1
21
31

answered Apr 12 '15 at 05:27

user3670684

1,135
9
8

Je Hsers · Answer 3 · 2017-05-30T11:37:07.343

12

The readr package from the tidyverse universe might help.

You can set the encoding via the local argument of the read_csv() function by using the local() function and its encoding argument:

read_csv(file = "http://www.mof.go.jp/international_policy/reference/itn_transactions_in_securities/week.csv",
         skip = 14,
         local = locale(encoding = "latin1"))

edited May 30 '17 at 11:37

answered May 30 '17 at 11:31

Je Hsers

146
1
4

score 3 · Answer 4 · answered Dec 18 '18 at 14:20

The simplest solution I found for this issue without losing any data/special character (for example when using fileEncoding="latin1" characters like the Euro sign € will be lost) is to open the file first in a text editor like Sublime Text, and to "Save with encoding - UTF-8".

Then R can import the file with no issue and no character loss.

score 0 · Answer 5 · answered Apr 17 '15 at 00:20

For those using Rattle with this issue Here is how I solved it:

First make sure to quit rattle so your at the R command prompt
> library (rattle) (if not done so already)
> crv$csv.encoding="latin1"
> rattle()
You should now be able to carry on. ie, import your csv > Execute > Model > Execute etc.

That worked for me, hopefully that helps a weary traveller

score 0 · Answer 6 · answered Sep 20 '15 at 07:19

I had a similar problem with scientific articles and found a good solution here: http://tm.r-forge.r-project.org/faq.html

By using the following line of code:

tm_map(yourCorpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))

you convert the multibyte strings into hex code. I hope this helps.

score 0 · Answer 7 · answered Feb 21 '17 at 20:02

0

If the file you are trying to import into R that was originally an Excel file. Make sure you open the original file and Save as a csv and that fixed this error for me when importing into R.

answered Feb 21 '17 at 20:02

822_BA

98
1
7

stevec · Answer 8 · 2018-03-11T12:56:36.033

0

I had the same error and tried all the above to no avail. The issue vanished when I upgraded from R 3.4.0 to 3.4.3, so if your R version is not up to date, update it!

edited Mar 11 '18 at 12:56

answered Mar 09 '18 at 11:11

stevec

41,291
27
223
311

score 0 · Answer 9 · answered Jan 01 '20 at 12:49

0

I came across this error (invalid multibyte string 1) recently, but my problem was a bit different:

We had forgotten to save a csv.gz file with an extension, and tried to use read_csv() to read it. Adding the extension solved the problem.

answered Jan 01 '20 at 12:49

Mirabilis

471
3
6

Eric Leschinski · Answer 10 · 2023-03-05T15:27:24.253

Reproduce the read.csv error on multi-byte char repeatedly:

R's read.csv() will puke on all multi-byte characters if it is expecting a number.

I'm using Version: R version 4.2.1 (2022-06-23)

Put this data in file named: /tmp/foo.csv

#year,someval 
2022,0.1389 
2021,0.0000°
2020,0.2857

If you look close you can see the 0.0000 value on line 2 has a 'degree' symbol on it.

Load it this way using read.csv:

> read.csv('/tmp/foo.csv')

Error in type.convert.default(data[[i]], as.is = as.is[i], dec = dec,  : 
  invalid multibyte string at '<b0>0'
Calls: read.csv -> read.table -> type.convert -> type.convert.default
Execution halted

What does cat have to say about that guff:

$ cat /tmp/foo.csv 
#year,someval
2022,0.1389
2021,0.0000�
2020,0.2857

We do not tolerate that "Degrees" symbol. Changing the encoding does nothing to help. You could try telling read.csv to interpret everything as a string, but now you've got string to number conversion issues downstream.

What does read.csv2 have to say?:

> read.csv2('/tmp/foo.csv')
  X.year.someval
1 2022,0.1389
2 2021,0.000\xb0
3 2020,0.2857

https://www.codetable.net/hex/b0

score 0 · Answer 11 · answered Mar 22 '23 at 05:07

0

Did you use copy-paste to create CSV-file? I had the same error and successfully tried the most popular solution from this thread (fileEncoding="latin1"). After I re-saved the data frame into a CSV-file, I found that some cells had extra space after the cell value (encoded as A-tilde). I removed these spaces in the original file and was able to read it without fileEncoding="latin1" and without any error.

answered Mar 22 '23 at 05:07

iMSQ

21
4

This does not really answer the question. If you have a different question, you can ask it by clicking [Ask Question](https://stackoverflow.com/questions/ask). To get notified when this question gets new answers, you can [follow this question](https://meta.stackexchange.com/q/345661). Once you have enough [reputation](https://stackoverflow.com/help/whats-reputation), you can also [add a bounty](https://stackoverflow.com/help/privileges/set-bounties) to draw more attention to this question. - [From Review](/review/late-answers/34081960) – Paul Stafford Allen Mar 24 '23 at 13:51

score 0 · Answer 12 · answered May 08 '23 at 18:07

I have this problem with DBI connection reading a sql file with read_lines; but seems the file has nothing to do with. Refreshing my sql connection (re-connect) solves the issue.

I have not idea that strange behavior.

Sys.info()
       sysname        release        version             machine 
     "Windows"       "10 x64"  "build 19044"             "x86-64"

Invalid multibyte string in read.csv

12 Answers12

Reproduce the read.csv error on multi-byte char repeatedly:

Linked

Related