3

The following code generates data files where each row has a different number of columns. The option fill=TRUE appears to work only when a certain character limit is reached. For instance compare lines 1-3 with lines 9-11, noting that both of these examples work as expected. How can I read the entirety of notworking1.dat with fill=TRUE enabled and not just the first 100 rows?

for (i in seq(1000,1099,by=1)) 
    cat(file="working1.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "working1.dat", fill=TRUE)

for (i in seq(1000,1101,by=1)) 
    cat(file="notworking1.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "notworking1.dat", fill=TRUE)

for (i in seq(1,101,by=1)) 
    cat(file="working2.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "working2.dat", fill=TRUE)

The following solution will also fail

df <- fread(input = "notworking1.dat", fill=TRUE, col.names=paste0("V", seq_len(1101)))

Warning Message received:

Warning message: In data.table::fread(input = "notworking1.dat", fill = TRUE) : Stopped early on line 101. Expected 1099 fields but found 1100. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<1 2 3 4 ...

Claudio Paladini
  • 1,000
  • 1
  • 10
  • 20
algae
  • 407
  • 4
  • 15
  • 2
    https://github.com/Rdatatable/data.table/pull/5119 Looks like it's a work in progress... although it is not (yet) included in the current development version (as far as I can see)... – Wimpel Apr 29 '22 at 07:55

1 Answers1

2

We could find out maximum number of columns and add that many columns, then fread:

x <- readLines("notworking1.dat")
myHeader <- paste(paste0("V", seq(max(lengths(strsplit(x, " ", fixed = TRUE))))), collapse = " ")

# write with headers
write(myHeader, "tmp_file.txt")
write(x, "tmp_file.txt", append = TRUE)
# read as usual with fill
d1 <- fread("tmp_file.txt", fill = TRUE)

# check output
dim(d1)
# [1]  102 1101
d1[100:102, 1101]
#    V1101
# 1:    NA
# 2:    NA
# 3:  1101

But as we already have the data imported with readLines, we could just parse it:

x <- readLines("notworking1.dat")
xSplit <- strsplit(x, " ", fixed = TRUE)

# rowbind unequal length list, and convert to data.table
d2 <- data.table(t(sapply(xSplit, '[', seq(max(lengths(xSplit))))))

# check output
dim(d2)
# [1]  102 1101
d2[100:102, 1101]
#    V1101
# 1:  <NA>
# 2:  <NA>
# 3:  1101

It is a known issue GitHub issue 5119, not implemented but it is suggested fill will take integer as input, too. So the solution would be something like:

d <- fread(input = "notworking1.dat", fill = 1101)
zx8754
  • 52,746
  • 12
  • 114
  • 209
  • Notwithstanding performance issues, is there any reason not to use this `readLines` + `strsplit` combination, in general (for numeric types - integer and floating point). For instance I noticed that `readLines` does not introduce garbage values or rounding errors compared with `fread`: (https://stackoverflow.com/questions/71906814/fread-fwrite-introduces-garbage-values). – algae May 03 '22 at 00:43
  • 1
    @algae I was not aware of "garbage" issue. If I had a big file, I'd use the data.table solution, but the adding the header outside R, so that I read the file only once within R. If it was a small file, I'd use readLines+strsplit option to avoid dependency on data.table. – zx8754 May 03 '22 at 07:56