Reading specific data from large dataset based on criteria to avoid reading entire file into memory

Question

Software: R Studio
Version: 0.98.1102
Operating System: Windows 7 Professional

Issue #1: I have a .txt file that is 100MB+. It has 4 variables and over 500,000 observations for each variable.
Issue #2: Assuming column1 was a column with dates that were factors. Is it possible to change the class of only column1 to class of date using the colClasses argument of read.csv()?
If I read the file via:

mydata <- read.csv("myfile", sep = ";", na.strings = "?", stringsAsFactors = FALSE)

Issue #1
The file loads indefinitely on my computer due to the size of the file.

The file has the format

column1 column2    column3
dog          bird    apple
cat          dove   orange
rat          sparrow   kiwi
may          bird    apple
cat          dove   orange
rat          sparrow   kiwi

I'm trying to figure out how to do the following:
1. Read only the rows of from the data set where column 1 has "dog"
2. Read only the rows of the data set where column 1 has dog and column2 has bird

Things I have been trying so far 1. I read that I can load the entire data and then subset it but I really would like to avoid that. The reason is that the file is too large to load initially. I would like instead, to just load only specific data based on criteria

Issue #2
Assuming column1 was in the form of 05/01/2015 but had the class of "factor". Is it possible to change the class of only column 1 to class "date" using the colClasses argument of read.csv? Perhaps something like this?

mydata <- read.csv("myfile", sep = ";", na.strings = "?",   
stringsAsFactors = FALSE, colClasses = c(column1 =as.date(column1))

Or perhaps something like this

mydata <- read.csv("myfile", sep = ";", na.strings = "?",   
stringsAsFactors = FALSE, colClasses = c(column1 =strptime(column1 %MM%DD%YY))

You should provide the version of R, rather than the version of RStudio. — , Jun 04 '15 at 03:31
Not exactly what you want, but `read_csv` from the `readr` package is a lot (~10x) faster than `read.csv`, and of course `fread` from `data.table` is even faster (~2x). — Molx, Jun 04 '15 at 03:47

score 1 · Answer 1 · edited May 23 '17 at 10:30

You can read your data into chunks, say 1000 line at a time and subset them.

temp <- read.csv('file.csv', nrows=1000, stringsAsFactors=FALSE)

But using for loop is not always a good idea in R. So, i'd prefer using sqldf

library(sqldf)
power <- read.csv.sql("file.csv", sql = "select * from file where codition ", 
                      header = TRUE)

see more options on how to do that in this question How do i read only lines that fulfil a condition from a csv into R

Dinesh · Answer 2 · 2015-06-04T04:34:47.507

0

Read only the rows of from the data set where column 1 has "dog" Ans: I saved your data in the name of "data" and applied this option "data[grep("dog",data$column1),]"

Hope this help for you.

edited Jun 04 '15 at 04:34

answered Jun 04 '15 at 04:17

Dinesh

239
2
12

1

As far as I know, max.print does not increase memory capacity/usage, it sets the maximum number of lines that print to the terminal. http://stackoverflow.com/questions/6758727/how-to-increase-the-limit-for-max-print-in-r. You should revise or delete. – Pierre L Jun 04 '15 at 04:26
Is n't this idea requires saving the whole data into memory? which is the problem in the first place. – Fadwa Mar 20 '17 at 14:39

Reading specific data from large dataset based on criteria to avoid reading entire file into memory

2 Answers2