2

I'm new to pandas and I would like to use your help.

I have two files, one of them is really big (100G+), which I need to merge based on some columns. I skip some lines in the big file, thus I get the file as buffer to the read_csv method.

Firsy, I tried to use pandas. However, when I tried to open the file using pandas, the process was killed by the OS.

with open(self.all_file, 'r') as f:
    line = f.readline()
    while line.startswith('##'):
          pos = f.tell()
          line = f.readline()
    f.seek(pos)
    return pd.read_csv(f,sep='\t')

Afterwards, I tried to use dask instead of pandas, however dask can't get a buffer as input for read_csv method and it fails.

    return dd.read_csv(f,sep='\t')

How can I open large file as buffer and merge the two dataframes?

Thank you!

Dana Blanc
  • 51
  • 3
  • Performing a merge is straightforward with dask. For example, [this SO post](https://stackoverflow.com/a/54467495/4057186) shows how to do this. – edesz Apr 22 '19 at 00:36
  • Why does the large `DataFrame` have to be provided as a buffer? Is this a requirement? Or, can you just read the file using dask `.read_csv` directly? – edesz Apr 22 '19 at 00:37
  • Could you please provide a sample (first 5 rows) of each file? – edesz Apr 22 '19 at 00:39
  • @edesz I provide it as a buffer because I am skipping couple of lines (vcf header lines) – Dana Blanc Apr 22 '19 at 06:33

1 Answers1

0

IIUC:

  • you know the line numbers that you want to skip
  • since these are VCF header lines, these lines occur only at the start of the file

So, you can still use dd.read_csv since it accepts keywords from pandas.read_csv such as skiprows

  • see this SO post for a pandas example with skiprows
    • if skiprows is an integer (eg. 2), then .read_csv will skip 2 rows
    • if skiprows is a list of integers (eg. [2,3]), then .read_csv will skip the line numbers (in the .csv file), starting at line number 0

So, you can read both files into .csv files with dask

df_1 = dd.read_csv('file_1.csv', skiprows=2, sep='\t') # skip line numbers 1, 2
df_2 = dd.read_csv('file_2.csv', skiprows=[10, 16]) # skip line numbers 11, 17

Then to merge the 2 DataFrames with dask .merge

df_merged = dd.merge(df_1, df_2, left_on='abcd', right_on='abcde')

If this is what you are asking for, then you don't need to use a buffer.

edesz
  • 11,756
  • 22
  • 75
  • 123