1

fread function from data.table package reads large csv files faster than the read.cvs function. But as you can see from the output of a data frame from both routines are different for the "device _id" column (see last 3 digits). Why? Is there a parameter in these functions to read them correctly? Or this is a normal behavior for fread? (it reads this datafile 10x faster though).

# Read file
p<-fread("C:\\User\\Documents\\Data\\device.csv",sep=", integer64="character" )
> str(p)
         Classes ‘data.table’ and 'data.frame': 187245 obs. of  3 variables:
         $ device_id   : Factor w/ 186716 levels "-1000025442746372936",..: 89025 96789 140102 123523 45208 118633 32423 22215 54410 81947 ...
         $ phone_brand : Factor w/ 131 levels "E<U+4EBA>E<U+672C>""| __truncated__,"E<U+6D3E>""| __truncated__,..: 52 52 16 10 16 32 52 32 52 14 ...
         $ device_model: Factor w/ 1598 levels "1100","1105",..: 1517 750 561 1503 537 775 753 433 759 983 ...
         - attr(*, ".internal.selfref")=<externalptr>

> head(p)
                          device_id            brand                     device_model
            1: -8890648629457979026 <U+5C0F><U+7C73>                 <U+7EA2><U+7C73>
            2:  1277779817574759137 <U+5C0F><U+7C73>                             MI 2
            3:  5137427614288105724 <U+4E09><U+661F>                        Galaxy S4
            4:  3669464369358936369            SUGAR <U+65F6><U+5C1A><U+624B><U+673A>
            5: -5019277647504317457 <U+4E09><U+661F>                    Galaxy Note 2
            6:  3238009352149731868 <U+534E><U+4E3A>                             Mate

# Read file
p<-read.csv("C:\\Users\\Documents\\Data\\device.csv",sep=",")

# Convert device_id to character
> p$device_id<-as.character(p$device_id)

> str(p)
    'data.frame':   187245 obs. of  3 variables:
 $ device_id   : chr  "-8890648629457979392" "1277779817574759168" "5137427614288105472" "3669464369358936576" ...
 $ phone_brand : chr  "<U+5C0F><U+7C73>""| __truncated__ "<U+5C0F><U+7C73>""| __truncated__ "<U+4E09><U+661F>""| __truncated__ "SUGAR" ...
 $ device_model: chr  "<U+7EA2><U+7C73>""| __truncated__ "MI 2" "Galaxy S4" "<U+65F6><U+5C1A><U+624B><U+673A>""| __truncated__ ...

    > head(p)
                     device_id            brand                     device_model
        1 -8890648629457979392 <U+5C0F><U+7C73>                 <U+7EA2><U+7C73>
        2  1277779817574759168 <U+5C0F><U+7C73>                             MI 2
        3  5137427614288105472 <U+4E09><U+661F>                        Galaxy S4
        4  3669464369358936576            SUGAR <U+65F6><U+5C1A><U+624B><U+673A>
        5 -5019277647504317440 <U+4E09><U+661F>                    Galaxy Note 2
        6  3238009352149731840 <U+534E><U+4E3A>                             Mate
user1046647
  • 369
  • 1
  • 6
  • 18
  • I google the codes and it seems those are unicodes for some chinese characters. Are you trying to import chinese brands and devices? – f.lechleitner Oct 10 '17 at 20:39
  • 1
    Try showing the `class()` of each of the columns with `str()` or something. What is the value that's actually in the file? It would be easier to help if you provided a proper [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). You probably want to force device_id to be a character value in both cases. It's likely number with `read.csv`. – MrFlick Oct 10 '17 at 20:40
  • As @MrFlick wrote, please provide the files for examination. Also, if you're looking for speed and better control over encoding you should probably try `readr::read_csv`. – Adi Sarid Oct 10 '17 at 21:22
  • 1
    I added structure of data.frames created by both functions (see text above). I do not know how to attache a file for the data. Yes, the "brand" column is in Chinese. But this is irrelevant. What is important is the difference between the device_id values in the left most 3 digits in spite of the fact that in both cases device_id is of same class: factor. The first line in the csv file is "-8890648629457979026,小米,红米" i.e. "fread" reads the file correctly, but "read.csv" reads it wrongly. "readr::read_csv" converts device_id as double (e.g. -8.890649e+18 instead of -8890648629457979026) – user1046647 Oct 10 '17 at 23:30
  • I'd guess your device_ids exceed the maximum number of significant digits R can represent therefore `read.csv` is loosing information by representing it as floating point value. Please provide the column classes of both data.frames. See `.Machine$double*` components to query more details for your machine. A possible solution would be to read the device_id as character instead of number. – R Yoda Oct 11 '17 at 05:45
  • In the case for "read.csv", the device_id is converted to character format (see the text above). And yet it is still wrong. > .Machine$double returns NULL. (computer is a Lenovo T530 with 6GB memory x64 laptop). So what is wrong? – user1046647 Oct 11 '17 at 14:09

2 Answers2

2

Like teger elegantly discussed the read.csv function has a limitation in reading 64 bit numbers. So like fread, if the numerals argument is defined as "no.loss" read.cvs also works. Thanks all the contributors to this question.

p<-read.csv("C:\\Users\\Documents\\Data\\device.csv",sep=",",encoding="UTF-8", numerals="no.loss" )

> head(p)
              device_id      phone_brand                     device_model
1: -8890648629457979026 <U+5C0F><U+7C73>                 <U+7EA2><U+7C73>
2:  1277779817574759137 <U+5C0F><U+7C73>                             MI 2
3:  5137427614288105724 <U+4E09><U+661F>                        Galaxy S4
4:  3669464369358936369            SUGAR <U+65F6><U+5C1A><U+624B><U+673A>
5: -5019277647504317457 <U+4E09><U+661F>                    Galaxy Note 2
6:  3238009352149731868 <U+534E><U+4E3A>                             Mate
user1046647
  • 369
  • 1
  • 6
  • 18
1

If the bit64 library is present, fread will automatically use it to correctly read integers that exceed 2^32 - 1.

read.csv does not do that, so it suffers from overflow.

This is mentioned in the first paragraph at ?fread:

Similar to read.table but faster and more convenient. All controls such as sep, colClasses and nrows are automatically detected. bit64::integer64 types are also detected and read directly without needing to read as character before converting.

You are using the integer64="character" option, so they will be detected and read as characters. With read.table, they will not be detected and not read as characters. If you want read.csv to behave similarly, you will need to use the colClasses argument to specify the column you want read as a character during import. By the time it has been read in, it is too late. The overflow has already resulted in lost information, p$device_id<-as.character(p$device_id) cannot "undo" the problem.

Is there a parameter in these functions to read them correctly? Or this is a normal behavior for fread?

Yes, fread is reading things correctly, this is normal behavior. read.csv will take a little more work to read things correctly - you will need to use the colClassses argument to read the long integer as a character. And it will still be slower.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294