6

I currently have a script that will combine multiple csv files into one, the script works fine except that we run out of ram really quickly when larger files start being used. This is an issue for one reason, the script runs on an AWS server and running out of RAM means a server crash. Currently the file size limit is around 250mb each, and that limits us to 2 files, however as the company I work is in Biotech and we're using Genetic Sequencing files, the files we use can range in size from 17mb up to around 700mb depending on the experiment. My idea has been to load one dataframe into memory whole and then chunk the others and combine iteratively, this didn't work so well.

My dataframes are similar to this (they can vary in size, but some columns remain the same; "Mod", "AA" and "Nuc")

+-----+-----+-----+-----+-----+-----+-----+-----+
| Mod | Nuc | AA  | 1_1 | 1_2 | 1_3 | 1_4 | 1_5 |
+-----+-----+-----+-----+-----+-----+-----+-----+
| 000 | ABC | ABC | 10  | 5   | 9   | 16  | 8   |
+-----+-----+-----+-----+-----+-----+-----+-----+
| 010 | CBA | CBA | 0   | 1   | 4   | 9   | 0   |
+-----+-----+-----+-----+-----+-----+-----+-----+

When combining the two frames I need them to merge on "Mod", "Nuc" and "AA" so that I have something similar to this

+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| Mod | Nuc | AA  | 1_1 | 1_2 | 1_3 | 1_4 | 1_5 | 2_1 | 2_2 | 2_3 |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 000 | ABC | ABC | 10  | 5   | 9   | 16  | 8   | 5   | 29  | 0   |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 010 | CBA | CBA | 0   | 1   | 4   | 9   | 0   | 0   | 0   | 1   |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

I already have code to change the names of the headers so I'm not worried about that, however when I use chunks I end up with something closer to

+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| Mod | Nuc | AA  | 1_1 | 1_2 | 1_3 | 1_4 | 1_5 | 2_1 | 2_2 | 2_3 | 3_1 | 3_2 | 3_3 |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 000 | ABC | ABC | 10  | 5   | 9   | 16  | 8   | 5   | 29  | 0   | NA  | NA  | NA  |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 010 | CBA | CBA | 0   | 1   | 4   | 9   | 0   | NA  | NA  | NA  | 0   | 0   | 1   |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

basically it treats each chunk as if it were a new file and not from the same one.

I know why its doing that but I'm not sure on how to fix this, right now my code for chunking is really simple.

    file = "tableFile/123456.txt"
    initDF = pd.read_csv(file, sep="\t", header=0)
    file2 = "tableFile/7891011.txt"
    for chunks in pd.read_csv(file2, sep="\t", chunksize=50000, header=0):
        initDF = initDF.merge(chunks, how='right', on=['Mod', "Nuc", "AA"])

as you can see its pretty bare bones, as I said I know why its doing what its doing but I'm not experienced with Pandas nor with dataframe joins to be able to fix it so any help would be much appreciated. I also couldn't find anything like this while I was searching stack and on google.

Peter Waters
  • 61
  • 1
  • 4
  • 1
    Could you try, to add copy() on the last line, in order to tell pandas to not update but create a new dataframe from the chunk added. – Florian Bernard Oct 17 '19 at 22:41
  • Sorry but add copy() to initDF or on chunks. I'll try both but just want to confirm incase I'm completely off base with both – Peter Waters Oct 17 '19 at 22:51
  • 1
    initDF = initDF.merge(chunks, how='outer', on=['Mod', "Nuc", "AA"]).copy() – Florian Bernard Oct 17 '19 at 22:52
  • 1
    Thanks for confirming. Just tried it and there wasn't any difference on the output. – Peter Waters Oct 17 '19 at 23:03
  • 1
    initDF = initDF.merge(chunks, how='outer', on=['Mod', "Nuc", "AA"], indicator="both")if that do not work try this as well. initDF = initDF.merge(chunks, how='outer', on=['Mod', "Nuc", "AA"], copy=False) – Florian Bernard Oct 17 '19 at 23:09
  • 1
    Thank you for your help Florian, I tried both methods but they didn't work. I figured it out though. Basically by checking the column names I can decide whether to use merge() or concat(). So if the columns from the chucks exist in initDF then I know I've already merged once and then concat() will make it work as intended – Peter Waters Oct 18 '19 at 00:10
  • Sorry, I was not help full. But that à good news if you have find the awser. Edit your post, or put an waser it may help somone else. – Florian Bernard Oct 18 '19 at 00:23
  • 1
    @PeterWaters try forcing dtypes of the columns while you load them into memory. Enforcing proper dtypes help you save a lot of memory. – Mahendra Singh Oct 18 '19 at 02:41
  • 1
    Hi Mahendra, in the live version of the code (the none chunked version) I force dtypes, it does save A LOT of memory. @FlorianBernard I spoke too soon, it looked like switching from merge() to concat() had fixed the problem at a glance, though once I checked the output thoroughly it didn't, it just gave me a new error where the concatted dataframe would wipe the shared columns from the initial dataframe.. so one step forward two steps back – Peter Waters Oct 18 '19 at 16:15

2 Answers2

3

The solution is to do it in chunks like you are but to concat the output into a new DataFrame like so:

file = "tableFile/123456.txt"
initDF = pd.read_csv(file, sep="\t", header=0)
file2 = "tableFile/7891011.txt"

amgPd = pd.DataFrame()              

for chunks in pd.read_csv(file2, sep="\t", chunksize=50000, header=0): 
    amgPd = pd.concat([amgPd, initDF.merge(chunks, how='right', on=['Mod', "Nuc", "AA"]]) 
PhoenixCoder
  • 81
  • 1
  • 6
  • Thanks @PhoenixCoder. I fixed the missing ")" and ran the code, but it's been nearly an hour and it's still running. Something is not right here.. – Susheel Busi Feb 08 '21 at 11:05
0

The solution of PhoenixCoder worked for problem, but I want to suggest a little speedup. The function pd.concat is quiet expensive in terms of running time, so maybe do it just one time at the end:

file = "tableFile/123456.txt"
initDF = pd.read_csv(file, sep="\t", header=0)
file2 = "tableFile/7891011.txt"

amgResultList = [] 
for chunks in pd.read_csv(file2, sep="\t", chunksize=50000, header=0): 
    amgResultList.append(initDF.merge(chunks, how='right', on=['Mod', "Nuc", "AA"]))
    
amgPd = pd.concat(amgResultList)

A good explanation why it is slow could be found here.

t2solve
  • 657
  • 4
  • 20