1

I am trying to read a file which in default is supposed to have 7 columns but probably there might be some commas within some strings which is causing other rows to have more than 7 columns. Regardless of which info that is in other columns my only goal is to read the first 7 columns. However, fread is not reading the whole file even after adding the argument select = 1:7

> data <- fread("dpp.DAT",header=FALSE, fill=T, select = 1:7, sep=", ",stringsAsFactors = F)
Warning message:
In fread("dpp.DAT", header = FALSE, fill=T, select = 1:7,sep = ",", stringsAsFactors = F) :
  Stopped early on line 45922. Expected 7 fields but found 8. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<84172666,DS,BRAND 4 - DERIVATIVE,#PL LOC BDD  :  BDD - BRAND 3 - DERIVATIVE,37324,BLEND-A-MD-INSPRD-BY-NTR-SGHH,BLEND B MAR INSPIRED BY OTHER CHAMOMILE, VAG + HHHH>>

Is there trick you can suggest to read all the rows of the file?

Sample dataset

Macosso
  • 1,352
  • 5
  • 22

3 Answers3

3

Say we have a text file "test.txt" like this:

a,b,c
d,e,f
g,h,i,j
k,l,m

We can read it in and set FILL=T and then subset the final column out:

> fread("test.txt", fill=T)[,-4]
   V1 V2 V3
1:  a  b  c
2:  d  e  f
3:  g  h  i
4:  k  l  m

Or, set select=1:3:

> fread("test.txt", fill=T, select = 1:3)
   V1 V2 V3
1:  a  b  c
2:  d  e  f
3:  g  h  i
4:  k  l  m

EDIT

The solution was to use the cut unix command as such:

terminal$ cut Test_Fread_column.DAT -d',' -f1-7 > tmp
R> fread("tmp")
user438383
  • 5,716
  • 8
  • 28
  • 43
  • I forgot to mention in the question, but I already attempted to provide argument ```fill=T``` but I still get the same error – Macosso Dec 08 '21 at 14:56
  • @Macosso probably this will be hard to fix unless you provide a sample of the data. Can you post a reduced subset of the data, including the offending lines? – user438383 Dec 08 '21 at 14:58
  • You can use the following dataset to test https://github.com/Macosso/Test_R_fread – Macosso Dec 08 '21 at 15:49
  • OK. It was easier just to do ``cut Test_Fread_column.DAT -d',' -f1-7 > tmp`` (in the unix shell) and then ``fread("tmp")``, which worked fine. – user438383 Dec 08 '21 at 19:45
  • can you just update your answer with this command so that it can get approved. Also can you share the command for windows CMD if you have have? – Macosso Dec 09 '21 at 09:10
  • Don't know the windows command, but check [this post](https://stackoverflow.com/a/25067069/5784757) out. – user438383 Dec 09 '21 at 09:22
3

data.table gets finicky about the extra columns showing up in the middle as opposed to the beginning so that's why using select and fill don't work here. What you can do is take all the rows it gives you up front and then try again with skip on the rows you've already loaded. On that second (or more) attempt the extra columns will now be in the beginning so fill and select work as expected. There are probably more elegant ways to do the following but this works

library(data.table)

#capture warnings so we can evaluate what happened last in code
tempfile='tmp321364.txt' 
conn<-file(tempfile, open="r+")
sink(file=conn, type='message')

DT<-list()
while(TRUE) {
  DT[[length(DT)+1]] <- fread(filename, header=FALSE,stringsAsFactors = F, fill=T, select=1:7, skip=ifelse(length(DT)>0,sum(sapply(DT, nrow)),0))
  if(nrow(DT[[length(DT)]])==0) break
  warns<-readLines(conn)
  if(length(warns)==3) { #The warning about extra columns is 3 lines long
    DT[[length(DT)+1]]<-  fread(filename, header=FALSE,stringsAsFactors = F, fill=T, select=1:7, skip=sum(sapply(DT, nrow)))
    if(nrow(DT[[length(DT)]])==0) break
  } else { #an error about skipping too many rows is not 3 lines, assuming away other issues
    break
  }
}
DT<-rbindlist(DT)
sink(NULL, type='message')
close(conn)
rm(tempfile)

With your exact data, you don't need the while(TRUE) loop but if, for example, there were a 10th column that shows up even further down then this will work for those cases.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
1

Dean's answer provides more automation than mine. Whenever I hit this problem (which is actually probably poorly formatted data), I resort to manually finding and then rebuilding the extract with rbind:

s1 <- fread("Extract.txt",
    nrows=674170,
    strip.white = TRUE,
    fill = TRUE,
    blank.lines.skip = TRUE,
    encoding="UTF-8")

s2 <- fread("Extract.txt",
    strip.white = TRUE,
    fill = TRUE,
    blank.lines.skip = TRUE,
    skip=674170,
    encoding="UTF-8")
# ad.infinitum until you complete "Extract.txt"
s3 <- rbind(s1,s2)
rm(s1)
rm(s2)
rferrisx
  • 1,598
  • 2
  • 12
  • 14