How do I read and clean a large (4GB) .csv file in R?

Question

I am trying to edit the column headings and remove some variables (columns) from a large .csv file. The file is just under 4GB and R isn't able to open it because the computer runs out of memory.

I have got the code to clean the data:

#label all of the columns; 
install.packages("plyr")
library(plyr)
house_all_name <- rename(house_all, c(V1="ID",V2="Price Paid", V3="Date Sold", V4="Post Code", 
                V5= "Property Type", V6= "New Build?", V7= "Tenure", V8= "House Name/Number (PAON)",
                V9= "SAON", V10= "Street", V11= "Locality", V12= "Town/City", V13= "District", 
                V14="County", V15="PPD Category Type", V16="Record Status"))

#remove the non-useful variables
house_clean <- house_all_name[,c(-1,-8:-16)]
str(house_clean)

I tried to use the following code to read the file but my computer just started being really slow, running out of memory.

house_all <- read.table("pp-complete.csv", header=FALSE, sep= ',', fill = TRUE)

Therefore, to do this I had to 'practice' on the first 5 rows:

house_all <- read.table("pp-complete.csv", header=FALSE, sep= ',', fill = TRUE, nrows = 5)

From my research I believe it is possible to read it line by line but I don't know how!

Regards, Tommy

p.s. The data file can be found at http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv

For the piece of your question that isn't too vague to address, I'd suggest that [this](http://stackoverflow.com/q/1727772/324364) is a duplicate, with the caveat that if by "it doesn't open" you mean you run out of memory, then the solution is get a machine with more memory, or only work with part of the data. — joran, Nov 14 '16 at 17:40
sorry first ever question! thank you for pointing me in the right direction. — tommylees112, Nov 14 '16 at 17:42
The duplicate system exists to help people more quickly/efficiently, not to make you feel bad. As you get to know the site, you'll get better at finding pre-existing stuff (usually searching Google is faster than search in-site). — joran, Nov 14 '16 at 17:44

How do I read and clean a large (4GB) .csv file in R?

0 Answers0