2

I have a large dataset that I would like to run some code against piece by piece due to its large size for my pc to run in one go.

Here's my code so far... My dataset has column gene, month, and count

df <- read.table(file = "/Users/x/x.txt", 
                         header = TRUE, sep=",", fill=TRUE, comment.char = "")

count_by_gene <- 
  df %>%
  group_by(gene) %>%
  summarize(count = n())

I'm unable to import due to the dataset being too large. Is there any way to do this piece by piece and create a different table(count_by_gene) for each piece?

2 Answers2

0

you can try file.split from NCmisc packages

library(NCmisc)
orig.dir <- getwd(); setwd(tempdir()); # move to temporary dir
file.name <- "x.txt"
writeLines(fakeLines(max.lines=1000),con=file.name)
new.files <- file.split(file.name,size=50)
unlink(new.files); unlink(file.name)
setwd(orig.dir) # reset working dir to original
TarJae
  • 72,363
  • 6
  • 19
  • 66
0

Hello from what you say you only need to load gene column to count by gene. For big data frame I would also direct you to data.table package that will be more efficient in reading from CSV and processing. If gene names are strings then loading them as factor (which are stored as integers) will further reduce memory footprint

library(data.table)

# fread is data.table's read.table. It is also smart in detecting
# separators otherwise you can still provide them as parameters
dt <- fread("yourfile.txt", select="gene", stringAsFactors=TRUE)

# here we group by "gene" values and compute for each group the 
# count (using .N pseudo variable)
# the empty comma at the beginning means that we want all lines
count_by_gene <- dt[, by="gene", list(count = .N)]

If it is still to big, provided that you can split the file in several chunks using for example hints from Split CSV files into smaller files but keeping the headers? as it seems you are using linux, then you can merge the results with following code

file_parts <- c("fic1.txt", "fic2.txt", .... )

# compute counts for each part
parts_counts <- lapply(file_parts, function(file) {
  dt <- fread(file, select="gene", stringAsFactors=TRUE)
  dt[, by="gene", list(count = .N)]
})

# merge part counts in a single table
merged_parts_counts <- rbindlist(parts_counts)

# then total count is sum of part counts
gene_counts <- merged_parts_counts[, by="gene", list(count=sum(count))]

It would also be worth having a look at package hdd (Easy Manipulation of Out of Memory Data Sets) that looks like what you are looking for.

Billy34
  • 1,777
  • 11
  • 11