0

I have a big dataset (.tab file) of more then 30gb which my current pc can not open in R. Can I somehow load only n:m rows of the file?

The point is that the data has about 20k columns but I need only a few of them. My idea is to load a subset of rows, let's say the first 100k, select only the relevant columns and save the data. Then I could open the next 100k rows, save them, and so on. All those created data files together will be smaller then the original dta file because I need only a few of the 20k columns. Thus, finally I can open all the created datasets and save them as one file. In order to do this I need to know how to load n:m rows of a .tab file.

All I found so far is the nrows argument in the read.table function. But this expects only on number (it loads always 1:m rows).

Alternatively, it would be even easier if there was a way directly to open only the relevant columns. Unfortunately, I did not find a way to do so.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
LulY
  • 976
  • 1
  • 9
  • 24
  • You can use the `skip` argument. – Konrad Rudolph Mar 22 '23 at 07:49
  • 1
    Look here: https://stackoverflow.com/questions/25932628/how-to-read-a-subset-of-large-dataset-in-r – Mario Mar 22 '23 at 07:51
  • 1
    Or use `fread` from `data.table`. It has a `select` argument with which you can select your relevant columns names. See more here https://stackoverflow.com/a/33201353/5621619 – Mario Mar 22 '23 at 07:54
  • You guys are awesome - and fast! Special thanks to @Mario: This answers my question best! – LulY Mar 22 '23 at 07:55

1 Answers1

2

use the skip parameter for that

example code that loads only rows from 100,000 to 200,000 from a .tab file:

data <- read.table("filename.tab", sep="\t", header=FALSE, nrows=200000, skip=99999)

in this example, we have skipped 99,999 rows then read the next 100k as the first line index is 0