R Looping Large Data

Question

I have a large dataset that I would like to run some code against piece by piece due to its large size for my pc to run in one go.

Here's my code so far... My dataset has column gene, month, and count

df <- read.table(file = "/Users/x/x.txt", 
                         header = TRUE, sep=",", fill=TRUE, comment.char = "")

count_by_gene <- 
  df %>%
  group_by(gene) %>%
  summarize(count = n())

I'm unable to import due to the dataset being too large. Is there any way to do this piece by piece and create a different table(count_by_gene) for each piece?

Does that help you https://rpubs.com/msundar/large_data_analysis? — Christoph, Feb 04 '21 at 08:26
if the data set in its entirety is stored locally, perhaps the `disk.frame` package could be useful. — Donald Seinen, Feb 04 '21 at 08:30
Sorry, that does not help me. I need to process them piece by piece. I tried setting up a loop and process the data by using nrows and skip. Just can't get the code to work :/ — theoneandonlycow, Feb 04 '21 at 08:40

score 0 · Answer 1 · answered Feb 04 '21 at 08:38

you can try file.split from NCmisc packages

library(NCmisc)
orig.dir <- getwd(); setwd(tempdir()); # move to temporary dir
file.name <- "x.txt"
writeLines(fakeLines(max.lines=1000),con=file.name)
new.files <- file.split(file.name,size=50)
unlink(new.files); unlink(file.name)
setwd(orig.dir) # reset working dir to original

Billy34 · Answer 2 · 2021-02-04T16:13:29.550

Hello from what you say you only need to load gene column to count by gene. For big data frame I would also direct you to data.table package that will be more efficient in reading from CSV and processing. If gene names are strings then loading them as factor (which are stored as integers) will further reduce memory footprint

library(data.table)

# fread is data.table's read.table. It is also smart in detecting
# separators otherwise you can still provide them as parameters
dt <- fread("yourfile.txt", select="gene", stringAsFactors=TRUE)

# here we group by "gene" values and compute for each group the 
# count (using .N pseudo variable)
# the empty comma at the beginning means that we want all lines
count_by_gene <- dt[, by="gene", list(count = .N)]

If it is still to big, provided that you can split the file in several chunks using for example hints from Split CSV files into smaller files but keeping the headers? as it seems you are using linux, then you can merge the results with following code

file_parts <- c("fic1.txt", "fic2.txt", .... )

# compute counts for each part
parts_counts <- lapply(file_parts, function(file) {
  dt <- fread(file, select="gene", stringAsFactors=TRUE)
  dt[, by="gene", list(count = .N)]
})

# merge part counts in a single table
merged_parts_counts <- rbindlist(parts_counts)

# then total count is sum of part counts
gene_counts <- merged_parts_counts[, by="gene", list(count=sum(count))]

It would also be worth having a look at package hdd (Easy Manipulation of Out of Memory Data Sets) that looks like what you are looking for.

This is what I did. The file is still too huge to be able to process it in one go. I also would like to merge this dataframe with a different dataset at some point. — theoneandonlycow, Feb 04 '21 at 11:29

R Looping Large Data

2 Answers2