Perfomance of Python when dealing with big datasets

Question

I have some questions about speed of Python. I have two lists of lists with data, which look like this:

GCA_NUMBER.VERSION name sth_else etc. (FILE A - 170k lines) 
GCF_NUMBER.VERSION name sth_else etc. (FILE B - 450k lines)

The goal is to eliminate duplicates from file A which occur in file B eg.:

GCA_0000025.1
GCF_0000025.5

I only care about part with NUMBER, but I cannot loose other informations like name.

I tried two approaches:

for i in FILE_A:
    for j in FILE_B:
        if i[0] == j[0]:then sth

which took about 17 minutes and second:

tmp_lst = [i[0] for i in FILE_B]
for i in FILE_A:
    if i not in tmp_lst: then sth

which took about 13 minutes. Is there a faster way?

score 1 · Accepted Answer · answered Aug 09 '19 at 13:55

Several great options located here: How can I compare two lists in Python and return matches that can be adjusted to accomplish your goal.

The answer by Joshmaker dives into the performance of a few options with larger datasets.

A solution I have used was reading the files as pandas dataframes and using an outer join while dropping duplicates. This was fairly efficient for a dataset of ~2-10k lines.

def compareFile(newDataframe,oldDataframe):
        combinedDataframe = pd.concat([newDataframe,oldDataframe],sort=True,axis=0,ignore_index=True, join="outer").drop_duplicates(subset=["Date","Facility","Measure","Procedure"]).reset_index()
        return combinedDataframe

Perfomance of Python when dealing with big datasets

1 Answers1