Hello from what you say you only need to load gene
column to count by gene
. For big data frame I would also direct you to data.table
package that will be more efficient in reading from CSV and processing. If gene names are strings then loading them as factor
(which are stored as integers) will further reduce memory footprint
library(data.table)
# fread is data.table's read.table. It is also smart in detecting
# separators otherwise you can still provide them as parameters
dt <- fread("yourfile.txt", select="gene", stringAsFactors=TRUE)
# here we group by "gene" values and compute for each group the
# count (using .N pseudo variable)
# the empty comma at the beginning means that we want all lines
count_by_gene <- dt[, by="gene", list(count = .N)]
If it is still to big, provided that you can split the file in several chunks using for example hints from Split CSV files into smaller files but keeping the headers? as it seems you are using linux, then you can merge the results with following code
file_parts <- c("fic1.txt", "fic2.txt", .... )
# compute counts for each part
parts_counts <- lapply(file_parts, function(file) {
dt <- fread(file, select="gene", stringAsFactors=TRUE)
dt[, by="gene", list(count = .N)]
})
# merge part counts in a single table
merged_parts_counts <- rbindlist(parts_counts)
# then total count is sum of part counts
gene_counts <- merged_parts_counts[, by="gene", list(count=sum(count))]
It would also be worth having a look at package hdd
(Easy Manipulation of Out of Memory Data Sets) that looks like what you are looking for.