6

I have a huge dataset in the form of a txt file with values separated by semi colons and has close to 2M rows. I need data only corresponding to particular dates in the first col. Sample input is shown below:

Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000

Please help me to filter data corresponding to two dates say 1/2/2007 and 2/2/2007

Jaap
  • 81,064
  • 34
  • 182
  • 193
Sundararaman P
  • 351
  • 1
  • 7
  • 12
  • It is relatively easy to drop columns in a `read.csv` call through the col.classes argument. If your data is ordered by date, you could figure out the set of rows you want to read manually and then use the skip and nrows argument to read in those rows. – lmo Dec 09 '16 at 19:18
  • Maybe some ideas here: http://stackoverflow.com/questions/6592219/read-csv-from-specific-row/6592373#6592373 – MrFlick Dec 09 '16 at 19:20
  • 1
    Also, if this dataset gets close to eating up your available memory, you should take a look at `fread` in `data.table` or `read.csv.raw`/ `read.chunk` in `iotools`. – lmo Dec 09 '16 at 19:21
  • Pretty much the same question as this: http://stackoverflow.com/questions/24006475/subsetting-data-based-on-a-date-range-in-r/24006532#24006532 – MrFlick Dec 09 '16 at 19:22

1 Answers1

13

Here's a good answer on filtering during data import: https://stackoverflow.com/a/15967406/1152809

Basically, you need to use sqldf to filter during import. Here's something like what you need:

install.packages("sqldf")
library(sqldf)
df <- read.csv.sql("sample.csv", "select *, from file where Date = '01/02/2007' or Date = '2/2/2007 ", sep=";")

However, I haven't tested this because you didn't give us a dput of your data. Take a look at this post for info on how to do a good post on R.

Your dates are strings, so they can use the above. However, if you want to use date-specific functions like BETWEEN, you're going to need to change them to the correct format. Here's a sample:

df <- read.csv.sql("sample.csv", "select *, strftime('%d/%m/%Y', Date) as DateFormated from file where DateFormatted >= 1/2/2007 and DateFormatted <= 2/2/2007 ", sep=";")
Community
  • 1
  • 1
Travis Heeter
  • 13,002
  • 13
  • 87
  • 129