I have just been brought in on a project where we have several huge text files (each close to a gigabyte) of data that we want to put into tables and analyse. Each text file consists of data from a single year, and each data point is from one of three categories, and the end result we want is one list for each of the categories, containing observations from each year as columns.
The way it is done now is to read each of the files into a list, then split those lists according to the categories and make three new lists for each year, then rbind
all lists of a given category from different years into the final lists. See below for the R
file I was presented (anonymised):
Year1 <- read.table(YearOneFilePath)
table(Year1$category)
Year1A <- Year1[Year1$category == "A",]
Year1B <- Year1[Year1$category == "B",]
Year1C <- Year1[Year1$category == "C",]
rm(Year1)
Year2 <- read.table(YeartwoFilePath)
table(Year2$category)
Year2A <- Year2[Year2$category == "A",]
Year2B <- Year2[Year2$category == "B",]
Year2C <- Year2[Year2$category == "C",]
rm(Year2)
Year3 <- read.table(YearThreeFilePath)
table(Year3$category)
Year3A <- Year3[Year3$category == "A",]
Year3B <- Year3[Year3$category == "B",]
Year3C <- Year3[Year3$category == "C",]
rm(Year3)
A <- rbind(Year1A, Year2A, Year3A)
B <- rbind(Year1B, Year2B, Year3B)
C <- rbind(Year1C, Year2C, Year3C)
rm(Year1A)
rm(Year2A)
rm(Year3A)
rm(Year1B)
rm(Year2B)
rm(Year3B)
rm(Year1C)
rm(Year2C)
rm(Year3C)
This seems to me like it reads all the data form the files, and copies it twice while moving it around, which with large amounts of data like this takes a long time and a lot of memory. Obviously I can get around the YearXY
lists by putting YearX[YearX$Category == "Y",]
directly into the rbind
function, but that still means that I at some point in the execution have two full copies of everything. Is there a way to make the final A
, B
and C
lists from the files on only one readthrough of each file and without copying all the data an additional time?