Efficient way to subset a huge delimited file

Question

I have previously posted a question before to be solved using R: subset recursively a data.frame, however the file is so huge that I need a lot of time and RAM memory to read it. I wonder if I could use pandas in python to do the same, since I am newbie to python and pandas seems more similar to R, at least in its sintax. Here is a summary that my previous post:

PREVIOUS POST: I have a tab delimited file with close to a 15 million of rows, and its size is 27GB. I need an efficient to way to subset the data based on two criteria. I can do this is a for loop but was wondering if there is a more elegant way to do this, and obviously more efficient. The data.frame looks like this:

SNP         CHR     BP          P
rs1000000   chr1    126890980   0.000007
rs10000010  chr4    21618674    0.262098    
rs10000012  chr4    1357325     0.344192
rs10000013  chr4    37225069    0.726325    
rs10000017  chr4    84778125    0.204275    
rs10000023  chr4    95733906    0.701778
rs10000029  chr4    138685624   0.260899
rs1000002   chr3    183635768   0.779574
rs10000030  chr4    103374154   0.964166    
rs10000033  chr2    139599898   0.111846    
rs10000036  chr4    139219262   0.564791
rs10000037  chr4    38924330    0.392908    
rs10000038  chr4    189176035   0.971481    
rs1000003   chr3    98342907    0.000004
rs10000041  chr3    165621955   0.573376
rs10000042  chr3    5237152     0.834206    
rs10000056  chr4    189321617   0.268479
rs1000005   chr1    34433051    0.764046
rs10000062  chr4    5254744     0.238011    
rs10000064  chr4    127809621   0.000044
rs10000068  chr2    36924287    0.000003
rs10000075  chr4    179488911   0.100225    
rs10000076  chr4    183288360   0.962476
rs1000007   chr2    237752054   0.594928
rs10000081  chr1    17348363    0.517486    
rs10000082  chr1    167310192   0.261577    
rs10000088  chr1    182605350   0.649975
rs10000092  chr4    21895517    0.000005
rs10000100  chr4    19510493    0.296693

The first I need to do is to select those SNP with a P value lower than a threshold, then order this subset by CHR and BP. Once I have this subset, I need to fetch all the SNP that fall into a 500,000 window up and down from the significant SNP, this step will define a region. I need to do it for all the significant SNP and store each region into a list or something similar to carry out further analysis. For example, in the displayed data frame the most significant SNP (i.e below a threshold of 0.001) for CHR==chr1 is rs1000000 and for CHR==chr4 is rs10000092. Thus these two SNP would define two regions and I need to fetch in each of these regions the SNPs that fall into a region of 500,000 up and down from the POS of each of the most significant SNP.

The R's code solution provide by @eddi and @rafaelpereira is the following:

library(data.table) # v1.9.7 (devel version)
df <- fread("C:/folderpath/data.csv") # load your data
setDT(df) # convert your dataset into data.table
#1st step
# Filter data under threshold 0.05 and Sort by CHR, POS
df <- df[ P < 0.05, ][order(CHR, POS)]
#2nd step
df[, {idx = (1:.N)[which.min(P)]
  SNP[seq(max(1, idx - 5e5), min(.N, idx + 5e5))]}, by = CHR]

does the whole DF fit into RAM? And 27 GB - is it a size of DF in memory or is it a CSV file size? What is `POS` - or is it `BP` in your sample DF? — MaxU - stand with Ukraine, Jun 06 '16 at 15:39
Sorry @MaxU, I have edited the post to make it clearer. The size of the file is 27GB and it fits on memory but when I have all the memory available for me (I am doing it on a small server). Because of that I am trying to find a more efficient way since I have several files like this one. — user2380782, Jun 06 '16 at 15:42
I guess your CSV file has much more columns beside `['SNP', 'CHR', 'BP', 'P']` - is this correct? I made a small test - generated very similar DF with 4 columns and 20M rows - it took 534MB in RAM and 861MB as a CSV file. In this case you should definitely use `usecols=['SNP', 'CHR', 'BP', 'P']` when reading your DF, thus you will NOT read the columns which you don't need — MaxU - stand with Ukraine, Jun 06 '16 at 19:31

score 2 · Accepted Answer · edited May 23 '17 at 10:29

First of all i would strongly recommend to switch from CSV files to PyTables (HDF storage) and store your DF sorted by ['SNP','BP'] if it's possible, as it's orders of magnitude faster, allows conditional selects (see where parameter) and usually takes less space - see this comparison.

HDF store recipes

Here is a working example script, which does the following:

generate sample DF (20M rows, 8 columns: 'SNP', 'CHR', 'BP', 'P', 'SNP2', 'CHR2', 'BP2', 'P2' ). I intentionally doubled the number of columns as i think your CSV has much more columns, because the size of my generated CSV file with 20M rows and 8 columns is only 1.7GB.
save generated DF to CSV file (File size: 1.7 GB)
read CSV file in 1M-rows chunks (read only first 4 columns)
sort DF by ['CHR','BP'] and save result as PyTable (.h5)
read from HDF store only those rows where P < threshold
read from HDF store all rows between min(SNP) - 500K and max(SNP) + 500K - you might want to improve this part

Code:

import numpy as np
import pandas as pd

##############################
# generate sample DF
##############################
rows = 2*10**7
chr_lst = ['chr{}'.format(i) for i in range(1,10)]

df = pd.DataFrame({'SNP': np.random.randint(10**6, 10**7, rows).astype(str)})
df['CHR'] = np.random.choice(chr_lst, rows)
df['BP'] = np.random.randint(10**6, 10**9, rows)
df['P'] = np.random.rand(rows)
df.SNP = 'rs' + df.SNP

"""
# NOTE: sometimes it gives me MemoryError !!!
# because of that i did it "column-by-column" before
df = pd.DataFrame({
    'SNP': np.random.randint(10**6, 10**7, rows).astype(str),
    'CHR': np.random.choice(chr_lst, rows),
    'BP':  np.random.randint(10**6, 10**9, rows),
    'P':   np.random.rand(rows)
}, columns=['SNP','CHR','BP','P'])

df.SNP = 'rs' + df.SNP
"""

# make 8 columns out of 4 ...
df = pd.concat([df]*2, axis=1)
df.columns = ['SNP', 'CHR', 'BP', 'P', 'SNP2', 'CHR2', 'BP2', 'P2']

##############################
# store DF as CSV file
##############################
csv_path = r'c:/tmp/file_8_cols.csv'
df.to_csv(csv_path, index=False)

##############################
# read CSV file (only needed cols) in chunks
##############################
csv_path = r'c:/tmp/file_8_cols.csv'
cols = ['SNP', 'CHR', 'BP', 'P']
chunksize = 10**6

df = pd.concat([x for x in pd.read_csv(csv_path, usecols=cols,
                                       chunksize=chunksize)],
               ignore_index=True )

##############################
# sort DF and save it as .h5 file
##############################
store_path = r'c:/tmp/file_sorted.h5'
store_key = 'test'
(df.sort_values(['CHR','BP'])
   .to_hdf(store_path, store_key, format='t', mode='w', data_columns=True)
)


##############################
# read HDF5 file in chunks
##############################
store_path = r'c:/tmp/file_sorted.h5'
store_key = 'test'
chunksize = 10**6

store = pd.HDFStore(store_path)

threshold = 0.001
store_condition = 'P < %s' % threshold

i = store.select(key=store_key, where=store_condition)

# select all rows between `min(SNP) - 500K` and `max(SNP) + 500K`
window_size = 5*10**5
start = max(0, i.index.min() - window_size)
stop = min(store.get_storer(store_key).nrows, i.index.max() + window_size)
df = pd.concat([
        x for x in store.select(store_key, chunksize=chunksize,
                                start=start, stop=stop, )
               ])

# close the store before exiting...
store.close()

sample data:

In [39]: df.head(10)
Out[39]:
                SNP   CHR       BP         P
18552732  rs8899557  chr1  1000690  0.764227
3837818   rs1883864  chr1  1000916  0.145544
13055060  rs2403233  chr1  1001591  0.116835
9303493   rs5567473  chr1  1002297  0.409937
14370003  rs1661796  chr1  1002523  0.322398
9453465   rs8222028  chr1  1004318  0.269862
2611036   rs9514787  chr1  1004666  0.936439
10378043  rs3345160  chr1  1004930  0.271848
16149860  rs4245017  chr1  1005219  0.157732
667361    rs3270325  chr1  1005252  0.395261

sorry for not replying first but I was traveling. I can store .h5 files. I have tried your example and it worked quite well, I will apply it on all the files, thanks a lot!! — user2380782, Jun 06 '16 at 22:36

Efficient way to subset a huge delimited file

1 Answers1