2

The csv file I'm trying to read has exactly the following format :

Date,x,y
"2015/08/01","71,131","20,390"
"2015/08/02","81,599","23,273"
"2015/08/03","79,435","21,654"
"2015/08/04","80,733","20,924"

The separator is comma, but each value is also enclosed in quotes because of the comma that serves as a thousands separator. I tried read.csv , read_csv from {readr} and fread from {data.table} and the best I can do is read all values is as strings and then use a combination of as.numeric and gsub to transform them into numbers.

I also found this: Most elegant way to load csv with point as thousands separator in R It is quite useful, but my data has a lot of columns (not all numeric) and I'd rather not specify column types.

Any ideas or should I start gsub-ing? On the fun side, Excel reads the file just fine :)

zx8754
  • 52,746
  • 12
  • 114
  • 209
BogdanC
  • 1,316
  • 3
  • 16
  • 36
  • Do you know which columns have to be transformed? – Jaap Oct 06 '15 at 13:08
  • @Jaap I guess, but at least it will read it correctly. Ah just saw they already tried that. nvm – David Arenburg Oct 06 '15 at 13:13
  • David - I'm not sure what that would accomplish. Anyway, I get this error: ` Expected sep (' ') but new line, EOF (or other non printing character) ends field 4 when detecting types ( first): Enter the new date: (mm-dd-yy) ` – BogdanC Oct 06 '15 at 13:14
  • Because you need to read it in the correct format, I couldn't format that in comments. Buy I guess you already tried that anyway. – David Arenburg Oct 06 '15 at 13:15
  • @Jaap - right now I do know (about 40 columns). However, there will be columns added in the future and I'd rather not perform manual changes each time. – BogdanC Oct 06 '15 at 13:16
  • If Excel can read the file correctly, then you could load it into Excel and resave as a tab-separated value file (might be ".txt" option). What's wrong with using `as.numeric` though? – CJB Oct 06 '15 at 13:27
  • @BogdanC See my answer for a possible solution. How are these future columns added? – Jaap Oct 06 '15 at 13:39

3 Answers3

3

With the data.table package you could do it as follows:

1: Create a vector of columnnames to be transformed. In this case the Date has to be excluded.

cols <- setdiff(names(dt),"Date")

2: Apply a transformation function to the remaining columns:

library(data.table)
dt[, (cols) := lapply(.SD, function(x) as.numeric(gsub(",", "", x))), .SDcols = cols]

This results in:

> dt
         Date     x     y
1: 2015/08/01 71131 20390
2: 2015/08/02 81599 23273
3: 2015/08/03 79435 21654
4: 2015/08/04 80733 20924

Used data:

dt <- fread('Date,x,y
            "2015/08/01","71,131","20,390"
            "2015/08/02","81,599","23,273"
            "2015/08/03","79,435","21,654"
            "2015/08/04","80,733","20,924"')
Jaap
  • 81,064
  • 34
  • 182
  • 193
2

You should be able to read the data with read.csv. Here an example

#write data
write('Date,x,y\n"2015/08/01","71,131","20,390"\n"2015/08/02","81,599","23,273"\n"2015/08/03","79,435","21,654"\n"2015/08/04","80,733","20,924"',"test.csv")

#use "text" rather than "file" in read.csv
#perform regex substitution before using read.csv
#the outer gsub with '(?<=\\d),(\\d{3})(?!\\d)' performs the thousands separator substitution
#the inner gsub replaces all \" with '
read.csv(text=gsub('(?<=\\d),(\\d{3})(?!\\d)',
                   '\\1',
                   gsub("\\\"",
                        "'",
                        paste0(readLines("test.csv"),collapse="\n")),
                   perl=TRUE),
         header=TRUE,
         quote="'",
         stringsAsFactors=FALSE)

The result

#        Date     x     y
#1 2015/08/01 71131 20390
#2 2015/08/02 81599 23273
#3 2015/08/03 79435 21654
#4 2015/08/04 80733 20924
cryo111
  • 4,444
  • 1
  • 15
  • 37
  • 1
    @BogdanC Please check with some sample data. It's hard to say beforehand if a regular expression processes all possible input correctly. – cryo111 Oct 06 '15 at 14:34
  • @cyro - wow! I hadn't given regular expressions a thought at all. I tested on some real data an it works great, I love that I don't have to adjust for column names or types. Thank you! – BogdanC Oct 06 '15 at 16:24
  • @BogdanC Glad it worked! BTW: You could make the regex pattern more restrictive (and therefore probably more fail-safe) by also including the enclosing quotation marks `"`. – cryo111 Oct 06 '15 at 16:39
1

The best solution is to remove all this formatting from your Excel sheet before exporting it.

Failing that, just use lapply to convert each column:

df[c("x", "y")] <- lapply(df[c("x", "y")], function(x) as.numeric(gsub(",", "", x)))
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187