I have a dataset with about 2 million rows, so without reading the whole dataset I want to read a subset of dataset . My dataset contains a date column in it so I just want to read dataset between a date range without reading whole dataset as it will be time consuming and memory waste. so how to accomplish it can anyone guide me on this ?
-
1Read the whole dataset with `fread` from package data.table or use package sqldf. See also: http://stackoverflow.com/q/1727772/1412059 – Roland Sep 19 '14 at 11:21
1 Answers
Use skip=
parameter in read.table
read.table("file.txt",skip= ,nrows= )
Both the skip=
and nrows=
take in row indicator numbers so just add them after the=.
The nrows=
defines how deep you range when you are importing the file.
I suggest reading https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html if you haven't done so already.
Also, please see one of my questions:
R - Reading lines from a .txt-file after a specific line
It, somewhat, touches the same subject.
The other possible way might be to use grep()
in skip=
read.table(...,skip=grep("2005-12-31", readLines("File.txt")),nrows=365)
What this line does is it skips until it finds the line depicted in grep()
and reads the lines after that. The nrow=
will stop the reading after it has read 365 lines (this way you have read one year of dates provided one line equals one date).
This seems kinda complicated, but it's the only way I know how to solve this.
-
if i dont know the starting date in the file how could i count the number of rows to skip ? – Zeeshan shaikh Sep 19 '14 at 11:26
-
Can you specify a bit? What kind of a file are you reading? What is the choice criteria regarding the date where the reading should begin? I mean, you must have some idea of what dates you want to import? Or am I missing something here. – Olli J Sep 19 '14 at 11:31
-
yes let me clear it to you , i have text file in which i have a column Date, I have to read data between two Dates i.e 2006-01-01 to 2007-01-01 – Zeeshan shaikh Sep 19 '14 at 11:33
-
Please see my edits in the original answer. I recommend viewing your text files with Notepad++. It shows you the row numbers by default. Knowing the row numbers really helps when you are reading files in to R with `read.table`. – Olli J Sep 19 '14 at 11:45
-
okay but still the problem is if two rows are there for single date – Zeeshan shaikh Sep 19 '14 at 11:59
-
That is true. You still might want to familiarize yourself with the link @Roland posted in his comment. Most problems in R can be solved in multiple ways. And as I said, I only know that rather cumbersome solution to your problem. – Olli J Sep 19 '14 at 12:06