1

I have just been brought in on a project where we have several huge text files (each close to a gigabyte) of data that we want to put into tables and analyse. Each text file consists of data from a single year, and each data point is from one of three categories, and the end result we want is one list for each of the categories, containing observations from each year as columns.

The way it is done now is to read each of the files into a list, then split those lists according to the categories and make three new lists for each year, then rbind all lists of a given category from different years into the final lists. See below for the R file I was presented (anonymised):

Year1 <- read.table(YearOneFilePath)
table(Year1$category)
Year1A <- Year1[Year1$category == "A",]
Year1B <- Year1[Year1$category == "B",]
Year1C <- Year1[Year1$category == "C",]
rm(Year1)
Year2 <- read.table(YeartwoFilePath)
table(Year2$category)
Year2A <- Year2[Year2$category == "A",]
Year2B <- Year2[Year2$category == "B",]
Year2C <- Year2[Year2$category == "C",]
rm(Year2)
Year3 <- read.table(YearThreeFilePath)
table(Year3$category)
Year3A <- Year3[Year3$category == "A",]
Year3B <- Year3[Year3$category == "B",]
Year3C <- Year3[Year3$category == "C",]
rm(Year3)


A <- rbind(Year1A, Year2A, Year3A)
B <- rbind(Year1B, Year2B, Year3B)
C <- rbind(Year1C, Year2C, Year3C)

rm(Year1A)
rm(Year2A)
rm(Year3A)
rm(Year1B)
rm(Year2B)
rm(Year3B)
rm(Year1C)
rm(Year2C)
rm(Year3C)

This seems to me like it reads all the data form the files, and copies it twice while moving it around, which with large amounts of data like this takes a long time and a lot of memory. Obviously I can get around the YearXY lists by putting YearX[YearX$Category == "Y",] directly into the rbind function, but that still means that I at some point in the execution have two full copies of everything. Is there a way to make the final A, B and C lists from the files on only one readthrough of each file and without copying all the data an additional time?

Arthur
  • 653
  • 2
  • 6
  • 21
  • Are the actual R objects so big ? it may be 1 Gb per file on your drive but much less once loaded in R especially if you have few distinct values for each field. you can check the size on the environment panel in Rstudio or use the function `object.size` – moodymudskipper Jun 06 '17 at 11:48
  • Consider using `rbindlist` to be faster in your latest steps : [link](https://stackoverflow.com/questions/15673550/why-is-rbindlist-better-than-rbind) – moodymudskipper Jun 06 '17 at 11:55
  • and maybe [this](https://stackoverflow.com/questions/9573055/r-selecting-subset-without-copying) if loading several times the tables is ok as long as you don't create copies. – moodymudskipper Jun 06 '17 at 12:00
  • @Moody_Mudskipper The biggest table ends up being just above one gigabyte. The smallest one is about 0.2GB. So it would be nice to not have to copy all of it. But it turns out that the file reading itself is by far the biggest time sink, and that seems like a rather unavoidable step. – Arthur Jun 06 '17 at 12:10
  • You could read by chunks with scan in a loop, then increment 3 lists with the relevant part of the chunk, then in the end do a rbindlist on each of those lists to get your final tables (let me know if it isn't clear). – moodymudskipper Jun 06 '17 at 12:22

1 Answers1

0
 library(data.table)

Year1 <- fread(YearOneFilePath)
Year1[, .N ,by = category]
Year1A <- Year1[Year1$category == "A",,]
Year1B <- Year1[Year1$category == "B",,]
Year1C <- Year1[Year1$category == "C",,]
rm(Year1)
gc()
#YES garbage collection may help ;)
A <- rbind(Year1A, Year2A, Year3A)
rm(Year1A)
rm(Year2A)
rm(Year3A)
gc()

For splitting here is one more method,

split_list1=split(Year1 ,Year1$category)
Year1A <-split_list1[[1]]
Year1B <-split_list1[[2]]
Year1C <-split_list1[[3]]

also see split data table to small tables R

Ajay Ohri
  • 3,382
  • 3
  • 30
  • 60