3

Let's take the following simplified version of a dataset that I import using read.table:

a<-as.data.frame(c("M","M","F","F","F"))
b<-as.data.frame(c(25,22,33,17,18))
df<-cbind(a,b)
colnames(df)<-c("Sex","Age")

In reality my dataset is extremely large and I'm only interested in a small proportion of the data i.e. the data concerning Females aged 18 or under. In the example above this would be just the last 2 observations.

My question is, can I just import these observations immediately without importing the rest of the data then using subset to refine my database. My computer's capacities are limited and so I have been using scan to import my data in chunks but this is extremely time consuming.

Is there a better solution?

user2568648
  • 3,001
  • 8
  • 35
  • 52
  • Where did you get the data from? – Fernando Jan 31 '14 at 17:34
  • The data is imported from a semi-colon delimited text file. – user2568648 Jan 31 '14 at 17:39
  • Yes, but how this file is generated? If your file is already mixed you can pre-proccess it using C, Java or other language for speed...then use R to analysis. – Fernando Jan 31 '14 at 17:42
  • 1
    The file is directly downloaded from the website of the Insee - http://www.insee.fr/fr/themes/detail.asp?reg_id=99&ref_id=fd-rp2010&page=fichiers_detail/RP2010/telechargement.htm - also I'm not experienced in C and Java – user2568648 Jan 31 '14 at 17:45
  • There are some R packages than can help you with memory/speed issues. Or you can create a new SO question so others can help you pre-process the data using other tools. – Fernando Jan 31 '14 at 17:50
  • You may be able to resolve your problem by using the 'ff' package. This leaves your data primarily on the disk, instead of loading it all into RAM, and the result is that you can perform functions on 'ffdf' objects that are much larger than your available RAM. – Dinre Jan 31 '14 at 18:14

3 Answers3

1

Some approaches that might work:

1 - Use a packages like ff than can help you with RAM issues.

2 - Use other tools/languages to clean your data before load it to R.

3 - If your file is not too big (i.e., you can load it without crashing), then you could save it to a .RData file and read from this file (instead of calling read.table):

 # save each txt file once...
 save.rdata = function(filepath, filebin) {
     dataset = read.table(filepath)
     save(dataset, paste(filebin, ".RData", sep = ""))
 }

 # then read from the .Rdata
 get.dataset = function(filebin) {
     load(filebin)
     return(dataset)
 }

This is much faster than read from a txt file, but i'm not sure if it applies to your case.

Fernando
  • 7,785
  • 6
  • 49
  • 81
1

There should be several ways to do this. Here is one using SQL.

library(sqldf)
result = sqldf("select * from df where Sex='F' AND Age<=18")

> result
  Sex Age
1   F  17
2   F  18

There is also a read.csv.sql function that you can filter with the above statement to avoid reading in the whole text file!

Drew75
  • 277
  • 1
  • 3
  • 11
  • I'm trying the `read.csv.sql` argument...20 minutes in and no result so far reading a file of 4.5 million rows...maybe I should invest in some new hardware. – user2568648 Jan 31 '14 at 19:07
  • Sounds bad! But, the read.csv.sql won't crash R or go over the memory limits if you add in the option `dbname = tempfile()`. With your large file size, maybe consider the ff package. I only avoid it because the commands are not so... friendly. – Drew75 Jan 31 '14 at 19:11
  • The ff package confuses me - I can quickly import the file --> in my workspace I see the file[1] class object, but don't know what to do next in order to select the subset and convert this to a dataframe. – user2568648 Jan 31 '14 at 19:17
  • It can be sped up somewhat by using the `filter = "grep '[FS]'"` argument of `read.csv.sql` to filter out all the females (and the header). Depending on what your data looks like you might be able to filter out even more. On Windows either use the `grep` from Rtools or else use Windows' `findstr` command. – G. Grothendieck Jan 31 '14 at 23:18
1

This is almost the same as @Drew75's answer but I'm including it to illustrate some gotcha's with SQLite:

# example: large-ish data.frame
df <- data.frame(Sex=sample(c("M","F"),1e6,replace=T),
                 Age=sample(18:75,1e6,replace=T))
write.csv(df, "myData.csv", quote=F, row.names=F)  # note: non-quoted strings

library(sqldf)
myData <- read.csv.sql(file="myData.csv",       # looks for char M (no qoutes)
                       sql="select * from file where Sex='M'", eol = "\n")
nrow(myData)
# [1] 500127

write.csv(df, "myData.csv", row.names=F)        # quoted strings...
myData <- read.csv.sql(file="myData.csv",       # this fails
                       sql="select * from file where Sex='M'", eol = "\n")
nrow(myData)
# [1] 0
myData <- read.csv.sql(file="myData.csv",       # need quotes in the char literal
                       sql="select * from file where Sex='\"M\"'", eol = "\n")
nrow(myData)
# [1] 500127
jlhoward
  • 58,004
  • 7
  • 97
  • 140