3

When I use R data.table(fread) to read dat file (3GB) a problem occurs:

Stopped early on line 3169933. Expected 136 fields but found 138. Consider fill=TRUE and comment.char=. First discarded non-empty line:

enter image description here

My code:

library(data.table)
file_path = 'data.dat' # 3GB
fread(file_path,fill=TRUE)

The problem is that my file has ~ 5 million rows. In detail:

  • From row 1 to row 3169933 it has 136 columns
  • From row 3169933 to row 5000000 it has 138 columns

fread() only reads my file to row 3169933 due to this error. fill = TRUE did not help in this case. Could anyone help me ?

R version: 3.6.3 data.table version: 1.13.2

Note about fill=TRUE in this case:

[Case 1- not my case] if part 1 of my file (50% rows) have 138 columns and part 2 have 136 columns then the fill=TRUE will help (it will fill two column in part 2 with NA)

[Case 2- my case] if part 1 of my file (50% rows) have 136 columns and part 2 have 138 columns then the fill =TRUE will not help in this case.

duy ngọc
  • 65
  • 1
  • 6

2 Answers2

2

Not sure why you still have the problem even with fill=T... But if nothing helps, you can try playing with something like this:

tryCatch(
  expr    = {dt1 <<- fread(file_path)},
  warning = function(w){
    cat('Warning: ', w$message, '\n\n');
    n_line <- as.numeric(gsub('Stopped early on line (\\d+)\\..*','\\1',w$message))
    if (!is.na(n_line)) {
      cat('Found ', n_line,'\n')
      dt1_part1 <- fread(file_path, nrows=n_line)
      dt1_part2 <- fread(file_path, skip=n_line)
      dt1 <<- rbind(dt1_part1, dt1_part2, fill=T)
    }
  },
  finally = cat("\nFinished. \n")
);

tryCatch() construct catches warning message so you can extract the line number and process it accordingly.

Vasily A
  • 8,256
  • 10
  • 42
  • 76
  • About fill=TRUE: [Case 1] if part 1 of my file have 138 columns and part 2 have 136 columns then the fill=TRUE will help (it will fill two column in part 2 with NA) but [Case 2] if part 1 of my file have 136 columns and part 2 have 138 columns then the fill =TRUE will not help in this case. In Case 2 your solution will help to read this file smoothly. Thank you so much again! – duy ngọc Nov 15 '20 at 07:20
0

Try to read them separately, combine them after creating two extra columns for the first part.

first_part = fread('data.dat', nrows = 3169933) %>%
  mutate(extra_1 = NA, extra_2 = NA)

second_part = fread('data.dat', skip = 3169933)
df = bind_rows(first_part, second_part)
Gejun
  • 4,012
  • 4
  • 17
  • 22
  • @ Anderson Zhu: Thank you for your help. The problem is that some DAT file will stop earlier in rows 3169933, others will stop earlier in different row (ex: row 2886321 or 3500212....). I cannot set nrows with specific number in generic case. I also try to extract string related to row number problem from this warning error (such as 3169933 in this case) but not feasible. Do you have any suggestion how to extract string related to row number problem from this warning error ? if the error row is recognized, I can separate this file in two parts and combine to one file like your suggestion. – duy ngọc Nov 15 '20 at 05:20
  • Another possible solution is to read your data line by line so that you can check the number of columns for each row. see https://stackoverflow.com/questions/11664075/import-dat-file-into-r Hope this helps. – Gejun Nov 15 '20 at 16:31
  • @ Anderson Zhu: Thank you so much for your pro active support. I will try this solution and compare to above method! – duy ngọc Nov 16 '20 at 08:52