Importing large datasets

Question

I have a CSV file that is 7GB. I actually can't open the file in R or Excel because of its giant size and therefore can't manipulate it in any way. Any thoughts or tricks on how to import such a large data set into R?

Not enough apparently. But I think that's irrelevant. I want to know what I can do BECAUSE I don't have enough memory to handle the job. I can't imagine R can handle such a large data set regardless of memory size anyway. Is that not the case? — theforestecologist, Jul 15 '15 at 22:47
Please provide some information about the amount of memory you have, and the type of data contained in the CSV (numeric/character, if it's sparse or not). There's not a single answer to this problem. — R_User, Jul 15 '15 at 22:47
4Gb. CSV file contains numerous types of data including numeric and character I'd guess based on source info. I can't actually know for sure, though, because I currently can't access /open it in any of my programs. So I unfortunately can't provide more details about it haha. — theforestecologist, Jul 15 '15 at 22:52
Pretty tempted to close as a duplicate of [this question](http://stackoverflow.com/q/1875795/903061).... do you see anything differentiating your case from this case? Probably the best resource there (and one that is regularly updated) is the High Performance Computing [CRAN Task View](http://cran.r-project.org/web/views/HighPerformanceComputing.html) which has a section on packages for dealing with Large memory and out-of-memory data. — Gregor Thomas, Jul 15 '15 at 22:52
Here's a start. Use read.csv(..., nrows=1000) to only read the first 1000 lines. You won't be able to load the entire file in memory, but you should be able to load a subset of it. — R_User, Jul 15 '15 at 22:53
Is there a way to determine the total number of rows of the data set so I could make a loop to read in all the data? For example, how would I know how many times I'd have to run read.csv(..., nrows=1000)? Also, is there a way to figure out how many rows I could read in at a time without guessing randomly and trying to force my machine to the max (taking forever!)? I'd like to streamline this whole process if possible. — theforestecologist, Jul 15 '15 at 22:59
Gregor, this title is more intuitive and easier to find via search. Otherwise, those resources do seem helpful for my question. — theforestecologist, Jul 15 '15 at 23:02
@theforestecologist - You simply cannot read all 7GB of data into memory on a machine that has 4GB of RAM using the a loop. If you want to count the number of rows in the file, use the `countLines()` function. You could also measure the size of a data frame by using `object.size()` as you go so that you can cut it off at a specific size (though this is extremely inefficient). I'd have agree with @Gregor on this one - have a look at the CRAN Task view on out of memory data. — R_User, Jul 15 '15 at 23:08
why not try `data.table` 's `fread` function? I work with several GB .csvs on my laptop (with 16 gb of ram) . ah, just saw you only have 4gb on your laptop. you could try AWS? — bjoseph, Jul 15 '15 at 23:09
@theforestecologist closing this question as a duplicate doesn't delete it, it just turns it into a pointer to the other question to save work of duplicating an answer. Agreed that your title is very easily search-able. — Gregor Thomas, Jul 15 '15 at 23:59
@bjoseph The linked dupe question/answer is old (6 years!), a new answer mentioning AWS, Azure and the like would probably be a good addition to that question's set of answers. — Gregor Thomas, Jul 16 '15 at 00:00
@Gregor agreed, but i'm worried that it would be too 'wiki-style' for a good StackOverflow answer. — bjoseph, Jul 16 '15 at 15:05

score 1 · Answer 1 · answered Jul 16 '15 at 00:05

Try:

library(RevoScaleR)
MyCsv <- file.path(rxGetOption("sampleDataDir"), "mycsv.csv")
MyXdf <- "mycsv.xdf"

This converts the file into an xdf file which uses algorithms to load and read the file. However, working with the data in this format will require the use of other functions from the RevoScaleR package just as an FYI.

score 1 · Answer 2 · edited May 23 '17 at 12:13

Consider an enterprise database to import and open the large CSV file. R can connect to any ODBC-compliant database including MS Access, MySQL, SQL Server, SQLite, Oracle, etc. using the RODBC package. Relational, indexed databases scale well with large data. And from there, you can even query certain amount of that data into a dataframe, selecting particular columns and filtering rows by various criteria.

library(RODBC)

# LISTS ALL INSTALLED ODBC SOURCES (DRIVERS AND DSNs)
odbcDataSources()

# SQL SERVER
myconn <-odbcDriverConnect('driver={SQL Server};server=SOMESERVER;database=DBName;
                            trusted_connection=yes;UID=username;PWD=password')
df1 <-sqlFetch(myconn, "dbo.Table1")    # WHOLE TABLES/VIEWS
df2 <- sqlQuery(myconn, "select * from dbo.Table1 where group = 1") # SELECT QUERIES
close(myconn)     

# MySQL
myconn <-odbcConnect('somedsn;host=localhost;database=DBName;UID=username;PWD=password')
df1 <-sqlFetch(myconn, "Table1")
df2 <- sqlQuery(myconn, "select * from Table1")
close(myconn) 

# MS ACCESS
myconn <-odbcDriverConnect('driver={Microsoft Access Driver (*.mdb, *.accdb)};
                            DBQ=C:\\PathTo\\DatabaseName.accdb')
df1 <-sqlFetch(myconn, "Table1")
df2 <- sqlQuery(myconn, "select * from Table1")
close(myconn)

Additionally, one thing you learn in data handling in both statistical packages and databases is columns are more resource-intensive than rows. Very wide datasets use up memory capacity than very long. In example, Excel xlsx's column limit is 2^14 or 16,384 but row limit is 2^20 or 1,048,576. MySQL has a hard limit of 4,096 columns per table; SQL Server at 1,024 for non-wide tables; and MS Access at 255 with neither database maintaining a declared limit on rows except for constraints to memory.

For statistical packages (competitors to R), SAS and Stata has a limit of 32,767 variables with workarounds depending on machine settings and types. Astoundingly, IBM's SPSS can contain up to 2.15 billion variables! As for R, there is a historic figure of 2^31 - 1 elements for any vector, matrix, array, list, or data frame column (owing to the 32-bit indexing legacy). but like Python data frames, dataset size and processing should be a function of a computer's RAM capacity.

With that said, consider breaking down your CSV file, extracting unneeded columns, and removing long texts not needed for analysis but do take up resources in bytes.

Importing large datasets

2 Answers2