R data.table problem when read file with inconsistent column

Question

When I use R data.table(fread) to read dat file (3GB) a problem occurs:

Stopped early on line 3169933. Expected 136 fields but found 138. Consider fill=TRUE and comment.char=. First discarded non-empty line:

My code:

library(data.table)
file_path = 'data.dat' # 3GB
fread(file_path,fill=TRUE)

The problem is that my file has ~ 5 million rows. In detail:

From row 1 to row 3169933 it has 136 columns
From row 3169933 to row 5000000 it has 138 columns

fread() only reads my file to row 3169933 due to this error. fill = TRUE did not help in this case. Could anyone help me ?

R version: 3.6.3 data.table version: 1.13.2

Note about fill=TRUE in this case:

[Case 1- not my case] if part 1 of my file (50% rows) have 138 columns and part 2 have 136 columns then the fill=TRUE will help (it will fill two column in part 2 with NA)

[Case 2- my case] if part 1 of my file (50% rows) have 136 columns and part 2 have 138 columns then the fill =TRUE will not help in this case.

https://stackoverflow.com/questions/44464441/r-is-there-a-good-replacement-for-plyrrbind-fill-in-dplyr after separately importing the two blocks of data. Or use awk to append ",NA,NA" to the first 3169933 lines. — IRTFM, Nov 15 '20 at 02:57
what means "fill = TRUE did not help" - what was the problem if you used `fill=TRUE`? — Vasily A, Nov 15 '20 at 03:58
@VasilyA : when I set fill=TRUE the error still occur: Stopped early on line 316993,Expected 136 fields but found 138. — duy ngọc, Nov 15 '20 at 05:11
@ Severin Pappadeux: I use R studio with R version: 3.6.3, data.table version: 1.13.2 — duy ngọc, Nov 15 '20 at 05:12

Vasily A · Accepted Answer · 2020-11-15T06:12:05.713

2

Not sure why you still have the problem even with fill=T... But if nothing helps, you can try playing with something like this:

tryCatch(
  expr    = {dt1 <<- fread(file_path)},
  warning = function(w){
    cat('Warning: ', w$message, '\n\n');
    n_line <- as.numeric(gsub('Stopped early on line (\\d+)\\..*','\\1',w$message))
    if (!is.na(n_line)) {
      cat('Found ', n_line,'\n')
      dt1_part1 <- fread(file_path, nrows=n_line)
      dt1_part2 <- fread(file_path, skip=n_line)
      dt1 <<- rbind(dt1_part1, dt1_part2, fill=T)
    }
  },
  finally = cat("\nFinished. \n")
);

tryCatch() construct catches warning message so you can extract the line number and process it accordingly.

edited Nov 15 '20 at 06:12

answered Nov 15 '20 at 06:04

Vasily A

8,256
10
42
76

About fill=TRUE: [Case 1] if part 1 of my file have 138 columns and part 2 have 136 columns then the fill=TRUE will help (it will fill two column in part 2 with NA) but [Case 2] if part 1 of my file have 136 columns and part 2 have 138 columns then the fill =TRUE will not help in this case. In Case 2 your solution will help to read this file smoothly. Thank you so much again! – duy ngọc Nov 15 '20 at 07:20

score 0 · Answer 2 · answered Nov 15 '20 at 04:15

0

Try to read them separately, combine them after creating two extra columns for the first part.

first_part = fread('data.dat', nrows = 3169933) %>%
  mutate(extra_1 = NA, extra_2 = NA)

second_part = fread('data.dat', skip = 3169933)
df = bind_rows(first_part, second_part)

answered Nov 15 '20 at 04:15

Gejun

4,012
4
17
22

@ Anderson Zhu: Thank you for your help. The problem is that some DAT file will stop earlier in rows 3169933, others will stop earlier in different row (ex: row 2886321 or 3500212....). I cannot set nrows with specific number in generic case. I also try to extract string related to row number problem from this warning error (such as 3169933 in this case) but not feasible. Do you have any suggestion how to extract string related to row number problem from this warning error ? if the error row is recognized, I can separate this file in two parts and combine to one file like your suggestion. – duy ngọc Nov 15 '20 at 05:20
Another possible solution is to read your data line by line so that you can check the number of columns for each row. see https://stackoverflow.com/questions/11664075/import-dat-file-into-r Hope this helps. – Gejun Nov 15 '20 at 16:31
@ Anderson Zhu: Thank you so much for your pro active support. I will try this solution and compare to above method! – duy ngọc Nov 16 '20 at 08:52

R data.table problem when read file with inconsistent column

2 Answers2