4

I am reading a file with several thousand columns, I am only interested in the first 10 columns. How can I tell fread to read the first 10 columns, and then concat all those thereafter into one column. I am assuming this would significantly speed up reading of the file.

Frank
  • 66,179
  • 8
  • 96
  • 180
Parsa
  • 3,054
  • 3
  • 19
  • 35
  • Do you mean that you don't care about the information after the first 10 columns? If that's the case, just use the `select` argument... – MichaelChirico Apr 30 '17 at 00:03
  • @MichaelChirico I do care about the information after the first 10 columns, I just dont necessarily need them to be processed as columns in the R data frame. For example I might want to change the order of the rows or subset the rows. – Parsa May 01 '17 at 13:57
  • I don't think I understand. Unless you provide a [minimal reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) my recommendation is still just `select = 1:10` – MichaelChirico May 01 '17 at 14:11

2 Answers2

2

You could do it with awk:

> fread("../foo.csv")
       a     b     c     d     e     f     g     h     i
   <int> <int> <int> <int> <int> <int> <int> <int> <int>
1:     1     2     3     4     5     6     7     8     9
2:     2     3     4     5     6     7     8     9    10
> fread("cat ../foo.csv | awk -F ',' 'BEGIN { s = 5 } { for (i=1; i<=NF; i++) printf(\"%s%s\", $(i), i<s ? OFS : i<NF ? \"\" : ORS) }'")
       a     b     c     d  efghi
   <int> <int> <int> <int>  <int>
1:     1     2     3     4  56789
2:     2     3     4     5 678910
> 

But if this doesn't parse right off the bat given the data you are working with, I'd probably nix the approach. An alternative would be to do the concat in post after the file is read in. I'm also skeptical that this would speed up the fread operation much.

Clayton Stanley
  • 7,513
  • 9
  • 32
  • 46
0

I may be mistaken, but I don't think that's possible directly when importing the data. But after reading you can get only the first in a new data frame and remove the old one. If you read your data in df you can simply do (NB Code not tested)

 df10 <- df[,1:10]
 df <- NULL

So you remove the big data frame from memory. Someone with more experience in reading big file may give his/her opinion or suggestions.

Umberto
  • 1,387
  • 1
  • 13
  • 29
  • 4
    As far as I know, select=1:10 does do that in fread. You skipped over the OP's (odd) idea about pasting/concatting the other columns together, btw. – Frank Apr 28 '17 at 20:48
  • Thanks @Frank. You are absolutely right. From the docs `select Vector of column names or numbers to keep, drop the rest.` (https://www.rdocumentation.org/packages/data.table/versions/1.10.4/topics/fread). Thanks for pointing that out. – Umberto Apr 28 '17 at 20:54