Merging large dataframe with smaller one (the large one is provided as buffer)

Question

I'm new to pandas and I would like to use your help.

I have two files, one of them is really big (100G+), which I need to merge based on some columns. I skip some lines in the big file, thus I get the file as buffer to the read_csv method.

Firsy, I tried to use pandas. However, when I tried to open the file using pandas, the process was killed by the OS.

with open(self.all_file, 'r') as f:
    line = f.readline()
    while line.startswith('##'):
          pos = f.tell()
          line = f.readline()
    f.seek(pos)
    return pd.read_csv(f,sep='\t')

Afterwards, I tried to use dask instead of pandas, however dask can't get a buffer as input for read_csv method and it fails.

    return dd.read_csv(f,sep='\t')

How can I open large file as buffer and merge the two dataframes?

Thank you!

Performing a merge is straightforward with dask. For example, [this SO post](https://stackoverflow.com/a/54467495/4057186) shows how to do this. — edesz, Apr 22 '19 at 00:36
Why does the large `DataFrame` have to be provided as a buffer? Is this a requirement? Or, can you just read the file using dask `.read_csv` directly? — edesz, Apr 22 '19 at 00:37
Could you please provide a sample (first 5 rows) of each file? — edesz, Apr 22 '19 at 00:39
@edesz I provide it as a buffer because I am skipping couple of lines (vcf header lines) — Dana Blanc, Apr 22 '19 at 06:33

edesz · Answer 1 · 2019-04-22T17:31:31.330

IIUC:

you know the line numbers that you want to skip
since these are VCF header lines, these lines occur only at the start of the file

So, you can still use dd.read_csv since it accepts keywords from pandas.read_csv such as skiprows

see this SO post for a pandas example with skiprows
- if skiprows is an integer (eg. 2), then .read_csv will skip 2 rows
- if skiprows is a list of integers (eg. [2,3]), then .read_csv will skip the line numbers (in the .csv file), starting at line number 0

So, you can read both files into .csv files with dask

df_1 = dd.read_csv('file_1.csv', skiprows=2, sep='\t') # skip line numbers 1, 2
df_2 = dd.read_csv('file_2.csv', skiprows=[10, 16]) # skip line numbers 11, 17

Then to merge the 2 DataFrames with dask .merge

df_merged = dd.merge(df_1, df_2, left_on='abcd', right_on='abcde')

If this is what you are asking for, then you don't need to use a buffer.

Merging large dataframe with smaller one (the large one is provided as buffer)

1 Answers1