How to open n:m rows of a tab file?

Question

I have a big dataset (.tab file) of more then 30gb which my current pc can not open in R. Can I somehow load only n:m rows of the file?

The point is that the data has about 20k columns but I need only a few of them. My idea is to load a subset of rows, let's say the first 100k, select only the relevant columns and save the data. Then I could open the next 100k rows, save them, and so on. All those created data files together will be smaller then the original dta file because I need only a few of the 20k columns. Thus, finally I can open all the created datasets and save them as one file. In order to do this I need to know how to load n:m rows of a .tab file.

All I found so far is the nrows argument in the read.table function. But this expects only on number (it loads always 1:m rows).

Alternatively, it would be even easier if there was a way directly to open only the relevant columns. Unfortunately, I did not find a way to do so.

Look here: https://stackoverflow.com/questions/25932628/how-to-read-a-subset-of-large-dataset-in-r — Mario, Mar 22 '23 at 07:51
Or use `fread` from `data.table`. It has a `select` argument with which you can select your relevant columns names. See more here https://stackoverflow.com/a/33201353/5621619 — Mario, Mar 22 '23 at 07:54
You guys are awesome - and fast! Special thanks to @Mario: This answers my question best! — LulY, Mar 22 '23 at 07:55

Youssef Mohamed · Answer 1 · 2023-03-22T08:08:17.063

2

use the skip parameter for that

example code that loads only rows from 100,000 to 200,000 from a .tab file:

data <- read.table("filename.tab", sep="\t", header=FALSE, nrows=200000, skip=99999)

in this example, we have skipped 99,999 rows then read the next 100k as the first line index is 0

edited Mar 22 '23 at 08:08

answered Mar 22 '23 at 07:50

Youssef Mohamed

64
7

How to open n:m rows of a tab file?

1 Answers1