0

I am using the LaF package to read this large file with 135M rows and 22 Cols ~ 15 GB of raw data, pipe delimited. The raw file unfortunately has a random head notes in the first 4 lines followed by column headers.

Edit: I am sorry I should have mentioned earlier, I am on Windows Server 2012 R2

The data is as follows:

gpg: encrypted with 1024-bit ELG key, ID XXXXXXXX, created 2006-10-30
***email id*** 
gpg: encrypted with 2048-bit RSA key, ID XXXXXXXX, created 2014-12-05
***email id*** 
COLUMN HEADERS (22) 
DATA 
. 
. 
.

I can get the model properly by skipping the first 4 lines.

modelF1 <- detect_dm_csv("trxn_.txt", sep="|", header=TRUE, nrows=10000, skip=4)
dfF1Laf <- laf_open(modelF1)

But when I try to skip the first 4 lines using goto it gives me the following error

goto(dfF1Laf,6)

Error in goto(dfF1Laf , 6) : Line has too many columns

How do I get around this?

I need to be able to summarize the data, so I am using this package as it seemed neat for my purpose. I have tried ffdf, data.table::fread, but they were either too slow or could not fit in the RAM.

I am open to using other packages as well.

Community
  • 1
  • 1
SatZ
  • 430
  • 5
  • 14
  • Take a look at the `iotools` package. – lmo Dec 08 '16 at 14:25
  • what OS are you on? Can you use `tail --lines=+4` in the shell? http://stackoverflow.com/questions/604864/print-a-file-skipping-x-lines-in-bash – Ben Bolker Dec 08 '16 at 14:33
  • I am on Windows Server 2012 R2 – SatZ Dec 08 '16 at 15:20
  • I was really short on time and had no option but to tamper with the input raw file. I used _cygwin_ and _sed_ to get rid of the first 4 lines. Installed cygwin and used `sed -i 1,4d myfile.txt` – SatZ Dec 09 '16 at 09:06

0 Answers0