4

I have a large CSV file and I only want to import select certain rows if it. First I create the indices of the rows that will be imported then I wish to pass the names of these rows to sqldf and return the full records for specified rows.

#create the random rows ids that will be sampled
library(dplyr)
#range for the values
index<-c(1:20)
index<-as.data.frame(as.matrix(index))
#number of values to be returned
number<-5
ids<-sample_n(index,number)

#sample the data
library(sqldf)
#filepath
f<-file("/Users/.../filename.csv")
#select data    
df<-sqldf("select * from f")

How to import a selection of rows from a CSV file by specifying the row numbers?

zx8754
  • 52,746
  • 12
  • 114
  • 209
SharkSandwich
  • 197
  • 1
  • 8
  • [Relevant post: Quickest way to read a subset of rows of a CSV](http://stackoverflow.com/a/25244592/680068) – zx8754 May 29 '15 at 11:58

2 Answers2

4

Try this example:

library(sqldf)

#dummy csv 
write.csv(data.frame(myid=1:10,var=runif(10)),"temp.csv")

#define ids
ids <- c(1,3,4)
ids <- paste(ids,collapse = ",")

f <- file("temp.csv")

#query with subset
fn$sqldf("select *
          from f
          where myid in ($ids)",
          file.format = list(header = TRUE, sep = ","))

#output
#     X myid       var
# 1 "1"    1 0.2310945
# 2 "3"    3 0.8825055
# 3 "4"    4 0.6655517

close(f)
zx8754
  • 52,746
  • 12
  • 114
  • 209
  • @G.Grothendieck, for some reason I couldn't pass external variable for IN operator, so used `paste` instead. – zx8754 May 29 '15 at 11:25
  • 1
    Thanks. I was trying something similar but couldn't get the 'where' clause to work. – SharkSandwich May 29 '15 at 12:22
  • @G.Grothendieck Yes, this works (updated the post), I meant to say, avoid collapsing ids into a string. – zx8754 May 29 '15 at 15:58
  • 1
    This also works: `fn$read.csv.sql("temp.csv", sql = "select * from file where myid in ($ids)")` and `toString(ids)` could be used in place of `paste(ids, collapse = ",")` – G. Grothendieck May 29 '15 at 16:54
1

maybe something base R like this ...

# dummy csv 
write.csv( data.frame( myid=1:10, var=runif(10) ),"temp.csv")

# define ids
ids <- c(1,3,4)

# reading from line 3 to 4 / reading 2 lines
read.table("temp.csv", header=T, sep=",", skip=2, nrows=2)

##   X2 X2.1 X0.406697876984254
## 1  3    3          0.6199803
## 2  4    4          0.0271722


# selctive line retrieval function 
dummy <- function(file, ids){
  tmp <- 
    mapply(
      read.table, 
      skip=ids, 
      MoreArgs= list(nrows=1, file=file, sep=",") ,
      SIMPLIFY = FALSE
    )
  tmp_df <- do.call(rbind.data.frame, tmp)
  names(tmp_df) <- names(read.table("temp.csv", header=T, sep=",",nrows=1))
  return(tmp_df)
}

# et voila
dummy("temp.csv", ids)

##   X myid       var
## 1 1    1 0.9040861
## 2 3    3 0.6027502
## 3 4    4 0.6829611
petermeissner
  • 12,234
  • 5
  • 63
  • 63