7

I got several CSV files which contain numbers in the local german style i.e. with a comma as the decimal separator and the point as the thousand separator e.g. 10.380,45. The values in the CSV file are separated by ";". The files also contain columns from the classes character, Date, Date & Time and Logical.

The problem with the read.table functions is, that you can specify the decimal separator with dec=",", but NOT the thousand point separator. (If I'm wrong, please correct me)

I know that preprocessing is a workaround, but I want to write my code in a way, that others can use it without me.

I found a way to read the CSV file the way I want it with read.csv2, by setting my own classes, as can be seen in the following example. Based on Most elegant way to load csv with point as thousands separator in R

# Create test example
df_test_write <- cbind.data.frame(c("a","b","c","d","e","f","g","h","i","j",rep("k",times=200)),
                            c("5.200,39","250,36","1.000.258,25","3,58","5,55","10.550,00","10.333,00","80,33","20.500.000,00","10,00",rep("3.133,33",times=200)),
                            c("25.03.2015","28.04.2015","03.05.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016",rep("08.08.2016",times=200)),
                            stringsAsFactors=FALSE)
colnames(df_test_write) <- c("col_text","col_num","col_date")

# write test csv
write.csv2(df_test_write,file="Test.csv",quote=FALSE,row.names=FALSE)

#### read with read.csv2 ####

# First, define your own class

#define your own numeric class
setClass('myNum')
#define conversion
setAs("character","myNum", function(from) as.numeric(gsub(",","\\.",gsub("\\.","",from))))

# own date class
library(lubridate)
setClass('myDate')
setAs("character","myDate",function(from) dmy(from))

# Read the csv file, in colClasses the columns class can be defined
df_test_readcsv <- read.csv2(paste0(getwd(),"/Test.csv"),
                       stringsAsFactors = FALSE,
                       colClasses = c(
                         col_text = "character",
                         col_num = "myNum",
                         col_date = "myDate"
                       )
                )

My problem now is, that the different datasets have up to 200 columns and 350000 Rows. With the upper solution I need between 40 and 60 seconds to load one CSV file and I would like to speed this up.

Through my research I found fread() from the data.table package, which is really fast. It takes approximately 3 to 5 seconds to load the CSV file.

Unfortunately there is also no possibility to specify the thousand separator. So I tried to use my solution with colClasses, but there seems to be the issue, that you can't use individual classes with fread https://github.com/Rdatatable/data.table/issues/491

See also my following test code:

##### read with fread ####
library(data.table)

# Test without colclasses
df_test_readfread1 <- fread(paste0(getwd(),"/Test.csv"),
                            stringsAsFactors = FALSE,
                            dec = ",",
                            sep=";",
                            verbose=TRUE)
str(df_test_readfread1)

# PROBLEM: In my real dataset it turns the number into an numeric column, 
# unforunately it sees the "." as decimal separator, so it turns e.g. 10.550, 
# into 10.5
# Here it keeps everything as character

# Test with colclasses
df_test_readfread2 <- fread(paste0(getwd(),"/Test.csv"),
                            stringsAsFactors = FALSE,
                            colClasses = c(
                              col_text = "character",
                              col_num = "myNum",
                              col_date = "myDate"
                            ),
                            sep=";",
                            verbose=TRUE)
str(df_test_readfread2)

# Keeps everything as character

So my question is: Is there a way to read CSV files with numeric values like 10.380,45 with fread?

(Alternatively: What is the fastest way to read a CSV with such numeric values?)

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
PhiSeu
  • 301
  • 2
  • 9
  • See [#1636](https://github.com/Rdatatable/data.table/issues/1636). This strikes me as a deficiency... not sure why setting `options("datatable.fread.dec.locale" = "de_DE.utf8")` doesn't fix the problem. @Arun is that not odd? – MichaelChirico Aug 17 '16 at 18:22

4 Answers4

2

I never used package myself, but it's from Hadley Wickham, should be good stuff

https://cran.r-project.org/web/packages/readr/readr.pdf

It supposed to handle locales:

locale(date_names = "en", date_format = "%AD", time_format = "%AT", decimal_mark = ".", grouping_mark = ",", tz = "UTC", encoding = "UTF-8", asciify = FALSE)

decimal_mark and grouping_mark is what you're looking for

EDIT form PhiSeu: Solution

Thanks to your suggestion here are two solutions with read_csv2() from the readr package. For my 350000 row CSV file it takes approximately 8 seconds, which is much faster then the read.csv2 solution. (Another helpful package from hadley and RStudio, thanks)

library(readr)

# solution 1 with specified columns
df_test_readr <- read_csv2(paste0(getwd(),"/Test.csv"),
                           locale = locale("de"),
                           col_names = TRUE,
                           cols(
                             col_text = col_character(),
                             col_num = col_number(), # number is automatically regcognized through locale=("de")
                             col_date2 = col_date(format ="%d.%m.%Y") # Date specification
                           )
                           )

# solution 2 with overall definition of date format
df_test_readr <- read_csv2(paste0(getwd(),"/Test.csv"),
                           locale = locale("de",date_format = "%d.%m.%Y"), # specifies the date format for the whole file
                           col_names = TRUE
)
PhiSeu
  • 301
  • 2
  • 9
Severin Pappadeux
  • 18,636
  • 3
  • 38
  • 64
1

Remove all commas first maybe.

filepath<-paste0(getwd(),"/Test.csv")
filestring<-readChar(filepath, file.info(filepath)$size)
filestring<-gsub('.','',filestring,fixed=TRUE)
fread(filestring)
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • This doesn't help unfortunately. fread detects this automatically (if you look at the printed code when verbose=TRUE). I added it to the example, to make it clear. But thanks. – PhiSeu Aug 17 '16 at 15:20
  • @PhiSeu sorry I tested it on my machine and it complained about machine locales so didn't know if it would work for you. – Dean MacGregor Aug 17 '16 at 15:28
  • No problem. You might have a problem in the example with the working directory where the CSV file will be wrote and read. I tested my example in the context of a RProject (on Windows 7). So it might not work for everybody. – PhiSeu Aug 17 '16 at 15:32
  • Interesting suggestion, unfortunately that deletes the dec "," separator, so 3.133,33 becomes 3.133,33. If I change your solution to filestring<-gsub('.','',filestring,fixed=TRUE) it deletes the points in my date column. – PhiSeu Aug 17 '16 at 15:58
  • @PhiSeu oh yeah I meant for it to be "." not "," but I'm too used to US way. You shouldn't need points in your date column. I would use `colClasses` to make it a string when loading and then modify if after loading with `as.POSIXct(colname, format="%Y%m%d")`. Obviously you'll have to change format to match exactly what you have. – Dean MacGregor Aug 17 '16 at 16:03
0

Found this discussion while working on the same problem. I ended up with

    dat <- fread(
        "myfile.csv", dec=",", sep=";", encoding="UTF-8"
    ) |> suppressWarnings()

    for (cn in numcols) if(!is.numeric(dat[,.(cn)])) {
        set(dat, j=cn, value=gsub(".", "", dat[[cn]], fixed=TRUE))
        set(dat, j=cn, value=gsub(",", ".", dat[[cn]], fixed=TRUE))
        set(dat, j=cn, value=as.numeric(dat[[cn]]))
    }

where numcols is a character vector of the names of the columns that are expected to be numeric.

Karsten W.
  • 17,826
  • 11
  • 69
  • 103
0

Yout can plug-in the custom type in data.table::fread like so, with the help of the readr::locale:

    library(data.table)
    library(readr)

    setAs("character","myDate",function(from) lubridate::dmy(from))
    # define your own numeric class
    setClass('myNumeric')
    # define conversion from character to myNumeric
    setAs("character", "myNumeric", \(from) 
      parse_number(from, locale = locale(decimal_mark = ",", 
                                         grouping_mark = ".")))
    dat <- fread("myfile.csv", 
                  dec=",", 
                  sep=";", 
                  encoding="UTF-8";
                  colClasses = c(col_text = "character",
                                 col_num = "myNumeric",
                                 col_date = "myDate")

What is being leveraged here is the colClasses specification which relies on the custom class conversion definition by setAs.

Patrik_P
  • 3,066
  • 3
  • 22
  • 39