4

Writing a data frame with a mix of small integer entries (value less than 1000) and "large" ones (value 1000 or more) into csv file with write_csv() mixes scientific and non-scientific entries. If the first 1000 rows are small values but there is a large value thereafter, read_csv() seems to get confused with this mix and outputs NA for scientific notations:

test_write_read <- function(small_value, 
                            n_fills, 
                            position, 
                            large_value) {
    tib             <- tibble(a = rep(small_value, n_fills))
    tib$a[position] <- large_value
    write_csv(tib, "tib.csv")
    tib             <- read_csv("tib.csv")
}

The following lines do not make any problem:

tib <- test_write_read(small_value = 1, 
                       n_fills     = 1001, 
                       position    = 1000, #position <= 1000
                       large_value = 1000)
tib <- test_write_read(1, 1001, 1001, 999)
tib <- test_write_read(1000, 1001, 1000, 1)

However, the following lines do:

tib <- test_write_read(small_value = 1, 
                       n_fills     = 1001, 
                       position    = 1001, #position > 1000
                       large_value = 1000)
tib <- test_write_read(1, 1002, 1001, 1000)
tib <- test_write_read(999, 1001, 1001, 1000)

A typical output:

problems(tib)
## A tibble: 1 x 5
#  row   col   expected               actual file
#  <int> <chr> <chr>                  <chr>  <chr>
#1 1001  a     no trailing characters e3     'tib.csv'

tib %>% tail(n = 3)
## A tibble: 3 x 1
#      a
#  <int>
#1   999
#2   999
#3    NA

The csv file:

$ tail -n3 tib.csv
#999
#999
#1e3

I am running:

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

with tidyverse_1.2.1 (loading readr_1.1.1)

Is that a bug that should be reported?

Habert
  • 347
  • 2
  • 10
  • 1
    `read_csv` has an argument `guess_max`, which by default will be set to 1000. So `read_csv` only reads the first 1000 records before trying to figure out how each column should be parsed. Increasing `guess_max` to be larger than the total number of rows should fix the problem. – Marius Jan 12 '18 at 02:39
  • 2
    You could also specify `col_types=...` as double or character. – CPak Jan 12 '18 at 03:27
  • 1
    Using @CPak's suggestion will make your code more reproducible and your analyses more predictable in the long run. That's a primary reason `read_csv()` spits out a message about the colspec upon reading (so you can copy it and use it). Copy it, modify it and tell it to use a different type. – hrbrmstr Jan 12 '18 at 04:09

2 Answers2

2

Adding the two answers, both correct, and the rationale as Community Wiki.

read_csv has an argument guess_max, which by default will be set to 1000. So read_csv only reads the first 1000 records before trying to figure out how each column should be parsed. Increasing guess_max to be larger than the total number of rows should fix the problem. – Marius 4 hours ago

You could also specify ,col_types= ..., as double or character. – CPak 3 hours ago

Using @CPak's suggestion will make your code more reproducible and your analyses more predictable in the long run. That's a primary reason read_csv() spits out a message about the colspec upon reading (so you can copy it and use it). Copy it, modify it and tell it to use a different type.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Using `col_types = "d"` indeed works here on this minimal example, but this problem stemmed from a real life problem where I might not know in advance where this column might be. Moreover I might need to keep the column as integer and then need an extra `mutate(a = as.integer(a))`. – Habert Jan 12 '18 at 20:24
  • I find the suggestion with `guess_max = ...` slightly more practical. In my real life problem the data frame has typically between 10K and 100K rows and events with integer values > 1000 are rare. There is no way to know in advance if the first such event will not occur after guess_max, unless I set guess_max to the number of rows. – Habert Jan 12 '18 at 20:31
  • What would solve completely this problem would be the `Allow int_use_scientific=FALSE` in write_csv from Zeehio [readr_commit](https://github.com/zeehio/readr/commit/9f4061269fb8b7a36d5f8d424ac54b093ec54c84) since there is really no need to write scientific notations automatically for integers >= 1000. – Habert Jan 12 '18 at 20:34
  • @Habert You can edit the answer. Might have more impact than a comment. – IRTFM Jan 12 '18 at 22:18
0

I just installed the dev version of readr: devtools::install_github("tidyverse/readr"), so now I have readr_1.2.0, and the NA problem went away. But the column "a" is "guessed" by read_csv() as dbl now (whether or not there is a large integer in it), whereas it was correctly read as int before, so if I need it as int I still have to do a as.integer() conversion. At least now it does not crash my code.

tib <- test_write_read(1, 1002, 1001, 1000)
tib %>% tail(n = 3)
## A tibble: 6 x 1
#        a
#    <dbl>
#1    1.00
#2 1000
#3    1.00

The large value is still written as 1e3 by write_csv(), though, so to my opinion this is not quite a final solution.

$ tail -n3 tib.csv
#1
#1e3
#1
Habert
  • 347
  • 2
  • 10