Read csv with numeric columns containing thousands separator

Question

The csv file I'm trying to read has exactly the following format :

Date,x,y
"2015/08/01","71,131","20,390"
"2015/08/02","81,599","23,273"
"2015/08/03","79,435","21,654"
"2015/08/04","80,733","20,924"

The separator is comma, but each value is also enclosed in quotes because of the comma that serves as a thousands separator. I tried read.csv , read_csv from {readr} and fread from {data.table} and the best I can do is read all values is as strings and then use a combination of as.numeric and gsub to transform them into numbers.

I also found this: Most elegant way to load csv with point as thousands separator in R It is quite useful, but my data has a lot of columns (not all numeric) and I'd rather not specify column types.

Any ideas or should I start gsub-ing? On the fun side, Excel reads the file just fine :)

@Jaap I guess, but at least it will read it correctly. Ah just saw they already tried that. nvm — David Arenburg, Oct 06 '15 at 13:13
David - I'm not sure what that would accomplish. Anyway, I get this error: ` Expected sep (' ') but new line, EOF (or other non printing character) ends field 4 when detecting types ( first): Enter the new date: (mm-dd-yy) ` — BogdanC, Oct 06 '15 at 13:14
Because you need to read it in the correct format, I couldn't format that in comments. Buy I guess you already tried that anyway. — David Arenburg, Oct 06 '15 at 13:15
@Jaap - right now I do know (about 40 columns). However, there will be columns added in the future and I'd rather not perform manual changes each time. — BogdanC, Oct 06 '15 at 13:16
If Excel can read the file correctly, then you could load it into Excel and resave as a tab-separated value file (might be ".txt" option). What's wrong with using `as.numeric` though? — CJB, Oct 06 '15 at 13:27
@BogdanC See my answer for a possible solution. How are these future columns added? — Jaap, Oct 06 '15 at 13:39

Jaap · Answer 1 · 2015-10-06T13:37:27.493

With the data.table package you could do it as follows:

1: Create a vector of columnnames to be transformed. In this case the Date has to be excluded.

cols <- setdiff(names(dt),"Date")

2: Apply a transformation function to the remaining columns:

library(data.table)
dt[, (cols) := lapply(.SD, function(x) as.numeric(gsub(",", "", x))), .SDcols = cols]

This results in:

> dt
         Date     x     y
1: 2015/08/01 71131 20390
2: 2015/08/02 81599 23273
3: 2015/08/03 79435 21654
4: 2015/08/04 80733 20924

Used data:

dt <- fread('Date,x,y
            "2015/08/01","71,131","20,390"
            "2015/08/02","81,599","23,273"
            "2015/08/03","79,435","21,654"
            "2015/08/04","80,733","20,924"')

cryo111 · Accepted Answer · 2015-10-06T14:37:13.137

2

You should be able to read the data with read.csv. Here an example

#write data
write('Date,x,y\n"2015/08/01","71,131","20,390"\n"2015/08/02","81,599","23,273"\n"2015/08/03","79,435","21,654"\n"2015/08/04","80,733","20,924"',"test.csv")

#use "text" rather than "file" in read.csv
#perform regex substitution before using read.csv
#the outer gsub with '(?<=\\d),(\\d{3})(?!\\d)' performs the thousands separator substitution
#the inner gsub replaces all \" with '
read.csv(text=gsub('(?<=\\d),(\\d{3})(?!\\d)',
                   '\\1',
                   gsub("\\\"",
                        "'",
                        paste0(readLines("test.csv"),collapse="\n")),
                   perl=TRUE),
         header=TRUE,
         quote="'",
         stringsAsFactors=FALSE)

The result

#        Date     x     y
#1 2015/08/01 71131 20390
#2 2015/08/02 81599 23273
#3 2015/08/03 79435 21654
#4 2015/08/04 80733 20924

edited Oct 06 '15 at 14:37

answered Oct 06 '15 at 13:25

cryo111

4,444
1
15
37

1

@BogdanC Please check with some sample data. It's hard to say beforehand if a regular expression processes all possible input correctly. – cryo111 Oct 06 '15 at 14:34
@cyro - wow! I hadn't given regular expressions a thought at all. I tested on some real data an it works great, I love that I don't have to adjust for column names or types. Thank you! – BogdanC Oct 06 '15 at 16:24
@BogdanC Glad it worked! BTW: You could make the regex pattern more restrictive (and therefore probably more fail-safe) by also including the enclosing quotation marks `"`. – cryo111 Oct 06 '15 at 16:39

score 1 · Answer 3 · answered Oct 06 '15 at 13:45

1

The best solution is to remove all this formatting from your Excel sheet before exporting it.

Failing that, just use lapply to convert each column:

df[c("x", "y")] <- lapply(df[c("x", "y")], function(x) as.numeric(gsub(",", "", x)))

answered Oct 06 '15 at 13:45

Hong Ooi

56,353
13
134
187

Read csv with numeric columns containing thousands separator

3 Answers3

Linked