11

I have a huge csv file about 15 million row, with size around 3G.

I would like to read this file into R by piece, each time only choose those rows fit into certain condition.

e.g. one of the column is called product type, so I only need to read one type of product into R, and process it then output the result, after that I move to another type of product...

so far I have read about different methods, such as upload the big file into database, or read column by column by colbycol, or read a chunk of rows by ff ...

is any pure R solution can solve my problem?

linus
  • 331
  • 2
  • 9
  • 2
    By "pure R" you mean sticking to base with no packages? If yes, why? Is it some limitation in your present setup or just a mental exercise? If you can consider a package, this sounds like a possible candidate for `sqldf`. – A5C1D2H2I1M1N2O1R2T1 Sep 13 '13 at 16:53
  • [**Here**](http://stackoverflow.com/a/15798275/1478381) is a very good answer by @eddi to help, assuming you are willing to look to use command line tools and not just `base::R`. – Simon O'Hanlon Sep 13 '13 at 16:59
  • 1
    If we're going down the path of non-R solutions, check out [**csvkit**](https://csvkit.readthedocs.org/en/latest/index.html) too. – A5C1D2H2I1M1N2O1R2T1 Sep 13 '13 at 17:04
  • I'd use `data.table`, and in particular `fread` here, if the data fits into your RAM and only resort to other things if it doesn't – eddi Sep 13 '13 at 17:32
  • does data.table need read the whole file into memory first? – linus Sep 16 '13 at 08:48
  • Packages `sqldf` and `RSQLite` do not need any setting up, just install and they work. Just like package `data.table` would, and yes, `fread` reads in everything. – ROLO Sep 16 '13 at 09:04

3 Answers3

20

You can use the RSQLite package:

library(RSQLite)
# Create/Connect to a database
con <- dbConnect("SQLite", dbname = "sample_db.sqlite")

# read csv file into sql database
# Warning: this is going to take some time and disk space, 
#   as your complete CSV file is transferred into an SQLite database.
dbWriteTable(con, name="sample_table", value="Your_Big_CSV_File.csv", 
    row.names=FALSE, header=TRUE, sep = ",")

# Query your data as you like
yourData <- dbGetQuery(con, "SELECT * FROM sample_table LIMIT 10")

dbDisconnect(con)

Next time you want to access your data you can leave out the dbWriteTable, as the SQLite table is stored on disk.

Note: the writing of the CSV data to the SQLite file does not load all data in memory first. So the memory you will use in the end will be limited to the amount of data that your query returns.

ROLO
  • 4,183
  • 25
  • 41
5

It is possible to do this only using R:

  1. open a connection to the file
  2. if there is a header, then read in the header information
  3. read one line from the file using read.csv specifying colClasses and nrows=1
  4. test that line to see if it meets your conditions, append it to the growing data frame if yes
  5. repeat step 4 for the rest of the file.
  6. close the connection

While the above is possible, I don't think that it is advisable. This type of thing is probably better done by loading the data into a database, then querying the database from within R.

Greg Snow
  • 48,497
  • 6
  • 83
  • 110
  • 1
    Reading one line at a time would be very inefficient, you'd be better of reading 10,000 lines (or more) at time. – hadley Sep 13 '13 at 20:53
  • 1
    @Hadley, My understanding is that behind the scenes R will read several rows, then just give you the number that you ask for, when you ask for more rows it gives them to you from its internal buffer, not rereading from disk (until you use up the buffer). But, I could be wrong. – Greg Snow Sep 13 '13 at 20:57
  • I'm pretty sure that's not the case, but you never know. – hadley Sep 15 '13 at 23:56
  • @Hadley, This post: http://tolstoy.newcastle.edu.au/R/help/05/12/18001.html suggests that the OS does the buffering. Either way I would still suggest the OP use the database approach rather than reading directly. – Greg Snow Sep 16 '13 at 15:41
0

You could also use JDBC to achieve this. Let's create a sample csv file.

write.table(x=mtcars, file="mtcars.csv", sep=",", row.names=F, col.names=T) # create example csv file

Download and save the the CSV JDBC driver from this link: http://sourceforge.net/projects/csvjdbc/files/latest/download and then setup the driver.

> library(RJDBC)

> path.to.jdbc.driver <- "jdbc//csvjdbc-1.0-18.jar"
> drv <- JDBC("org.relique.jdbc.csv.CsvDriver", path.to.jdbc.driver)
> conn <- dbConnect(drv, sprintf("jdbc:relique:csv:%s", getwd())) # replace getwd() with location of csv file

Let's look at the top 3 rows in the mtcars dataset:

> head(dbGetQuery(conn, "select * from mtcars"), 3)
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1   21   6  160 110  3.9  2.62 16.46  0  1    4    4
2   21   6  160 110  3.9 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85  2.32 18.61  1  1    4    1

Next, let's see the distinct values column gear takes and the respective counts:

> dbGetQuery(conn, "select gear, count(*) from mtcars group by gear")
  GEAR COUNT(*)
1    4       12
2    3       15
3    5        5

Now you can compose a query using the where clause to filter on the data to only select rows where gear takes the value 5:

> dbGetQuery(conn, "select * from mtcars where gear = '5'")
   mpg cyl  disp  hp drat    wt qsec vs am gear carb
1   26   4 120.3  91 4.43  2.14 16.7  0  1    5    2
2 30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
3 15.8   8   351 264 4.22  3.17 14.5  0  1    5    4
4 19.7   6   145 175 3.62  2.77 15.5  0  1    5    6
5   15   8   301 335 3.54  3.57 14.6  0  1    5    8
Rahul Premraj
  • 1,595
  • 14
  • 13