3

I tried to import a text file in R (3.4.0) which actually contains 4 columns but the 4th column is mostly empty until 200,000+th row. I use the fread() in package data.table (ver 1.10.4)

fread("test.txt",fill = TRUE, sep = "\t", quote = "", header = FALSE)

I got this error message:

Error in fread("test.txt", fill = TRUE, sep = "\t", quote = "", header = FALSE) : 
Expecting 3 cols, but line 258088 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep='  ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.

I checked the file and there's additional text in 258088th row in the 4th column ("8-4").

Nevertheless, fill = TRUE did not solve this as I expected. I thought it might be fread() determining column numbers inappropriately because the additional column occurs very late in the file. So I tried this:

fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 250000)

The error persisted. On the other hand,

fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 258080)

This gives no error.

I thought I found the reason, but the weird thing happened when I tested with a dummy file generated by:

write.table(matrix(c(1:990000), nrow = 330000), "test2.txt", sep = "\t", row.names = FALSE)

with the addition of a "8-4" in the 4th column of the 250000th row by Excel. When read by fread():

fread("test2.txt", fill = TRUE, header = FALSE, sep = "\t")

It worked fine with no error message, and this should indicate some late additional column not necessarily trigger error.

I also tried changing encoding ("Latin-1" and "UTF-8") or quote, but neither helped.

Now I feel clueless, and hopefully I did my homework enough with a reproducible information. Thank you for helping.

For additional environmental info, my sessionInfo() is:

R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/zh_TW.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
  [1] dplyr_0.5.0            purrr_0.2.2.2          readr_1.1.1            tidyr_0.6.3           
  [5] tibble_1.3.3           ggplot2_2.2.1          tidyverse_1.1.1        stringr_1.2.0         
  [9] microbenchmark_1.4-2.1 data.table_1.10.4     

loaded via a namespace (and not attached):
[1] Rcpp_0.12.11     cellranger_1.1.0 compiler_3.4.0   plyr_1.8.4       forcats_0.2.0   
[6] tools_3.4.0      jsonlite_1.5     lubridate_1.6.0  nlme_3.1-131     gtable_0.2.0    
[11] lattice_0.20-35  rlang_0.1.1      psych_1.7.5      DBI_0.6-1        parallel_3.4.0  
[16] haven_1.0.0      xml2_1.1.1       httr_1.2.1       hms_0.3          grid_3.4.0      
[21] R6_2.2.1         readxl_1.0.0     foreign_0.8-68   reshape2_1.4.2   modelr_0.1.0    
[26] magrittr_1.5     scales_0.4.1     rvest_0.3.2      assertthat_0.2.0 mnormt_1.5-5    
[31] colorspace_1.3-2 stringi_1.1.5    lazyeval_0.2.0   munsell_0.4.3    broom_0.4.2     
YC Chen
  • 33
  • 1
  • 6
  • 2
    I think the easiest way to fix this is to add a header to the top of the file, with the headings separated by tabs as in your file. `fread` looks at the first 30 rows of the data by default and uses that to infer how many columns it has, so with no data in the 4th column for those rows it assumes there are only 3 fields. – Marius Jun 23 '17 at 06:31
  • Perhaps add `quote=""` – IRTFM Jun 23 '17 at 06:36
  • I guess it's not an issue of `fread` or `read.csv`. The file has issues. A csv should have the same number of columns for each row and your file does not. You should work on the process generating the file and not on how to import it. – nicola Jun 23 '17 at 06:50
  • Thank you for all the advices. I tried to add a header there and it does work. Nevertheless I am working on a bunch of file like this, so I was wondering if there's some universal solution. I am aware of the awkward format of the first file, and since its a standard export from our analyzer, so there wasn't much choice left. I could fix the format issue first, but since `read.table()` works fine on it with `fill = TRUE`, it would be faster to leave it and go back to `read.table()`. I was looking for a faster way to arrange these when I turned to `fread()` – YC Chen Jun 23 '17 at 07:34

3 Answers3

3

Actually there is a difference between the two files that you provide, and I think this is the cause of the different outputs of the fread.

The first file has an end of the line after the 3rd column, except line 258088, where there is a tab a 4th column and then the end of the line. (You can use the option 'show all characters to confirm that').

On the other hand the second file has in all rows an extra tab, i.e. a new empty column. So in the first case fread expects 3 columns and then finds out a 4th column. On the contrary in the second file, fread expects 4 columns.

I checked read.table with fill=TRUE and it worked with both files. So I think that something is done differently with the fill option of the fread.

I would expect since fill=TRUE, all the lines to be used so as to infer the number of columns (with cost on computational time).

In the comments there are some nice workarounds you can use.

NpT
  • 451
  • 4
  • 11
  • Thank you for the insight! I didn't notice the difference. I guess that additional tab might be added by Excel, while the first example file was generated by our analyzer. `read.table()` with `fill = TRUE` does its work perfectly, and the reason why I am trying `fread()` is that we have a bunch of these to run, and a faster importing would be of great help. – YC Chen Jun 23 '17 at 07:43
  • 1
    You are welcome! I think it would pay off to add a header and use the fread, since as you said you have a lot of files and you use read.table much... – NpT Jun 23 '17 at 08:18
1

The file has issue: if the table has four columns, at the end of each row with the fourth column missing a \t should have been present.

In this case you may have better luck with a low-level approach: read the file line by line, add a \t to each row which doesn't have the fourth column, split each line with \t and collect all together in a data.frame. Most of the above work is done by the data.table::tstrsplit function. Try something like:

f<-readLines("test.txt")
require(stringr)
require(data.table)
a<-data.frame(tstrsplit(f,"\t",type.convert=TRUE,names=TRUE,keep=1:4),stringsAsFactors=FALSE)
str(a)
#'data.frame':  273070 obs. of  4 variables:
# $ V1: num  0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 ...
# $ V2: num  -18.7113 -1.2685 0.0768 0.1507 0.1609 ...
# $ V3: num  0 0 0 0 0 0 0 0 0 0 ...
# $ V4: chr  NA NA NA NA ...
nicola
  • 24,005
  • 3
  • 35
  • 56
0

I was struggling with this as well. I found another solution (for csv and read.table) here How can you read a CSV file in R with different number of columns. This answer you can use the handy function count.fields to count the delimiters of a file by line and then take the max field count to pass the max number of column names to fread. A reproducible example is below.

Generate text with uneven number of fields

text <- "12223, University\n12227, bridge, Sky\n12828, Sunset\n13801, Ground\n14853, Tranceamerica\n16520, California, ocean, summer, golden gate, beach, San Francisco\n14854, San Francisco\n15595, shibuya, Shrine\n16126, fog, San Francisco\n"

Write to file

cat(text, file = "foo")

Scan file for delimeters

max.fields<-max(count.fields("foo", sep = ','))

Now use fread to read file, but expect a max number of columns from the col.names argument

fread("foo", header = FALSE, fill=TRUE, sep=",", col.names = paste("V", 1:max.fields, sep = ""))

However, I was basing this data on the example data from ?count.fields and found if the max number of fields is in the last line of the file, fread will still fail with the following error.

Error in fread("foo", header = FALSE, fill = TRUE, sep = ",", col.names = paste("V", : Expecting 3 cols, but line 9 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.

example

text <- "12223, University\n12227, bridge, Sky\n12828, Sunset\n13801, Ground\n14853, Tranceamerica\n14854, San Francisco\n15595, shibuya, Shrine\n16126, fog, San Francisco\n16520, California, ocean, summer, golden gate, beach, San Francisco\n"
cat(text, file = "foo")
max.fields<-max(count.fields("foo", sep = ','))
fread("foo", header = FALSE, fill=TRUE, sep=",", col.names = paste("V", 1:max.fields, sep = ""))

I'll report this as an issue to the data.table Github. Update: issue logged here https://github.com/Rdatatable/data.table/issues/2691

Alex Thomas
  • 1,142
  • 1
  • 11
  • 13