0

I am using pandas frame to read in two text file that has 600K rows each and try to find the rows that has matching values between the two files. My current implementation is incredible slow, and I was wondering if there is anyway to use numpy to speed this up?


genes1 = pd.read_csv("gen1.txt", sep=" ", names=["id", "chromosome", "genotype"])
genes1 = genes1[~genes1.chromosome.isin(['X', 'Y'])]  # filter out unwanted data
genes2 = pd.read_csv("gen2.txt", sep=" ", names=["id", "chromosome", "genotype"])
genes2 = genes2[~genes2.chromosome.isin(['X', 'Y'])]  # filter out unwanted data


for ind in genes1.index:  # loop through each row of the dataframe
    if genes1["id"][ind] in genes2["id"].values:  # if the given id of genes1 exist in genes2
        gen1_string = str(genes1["genotype"][ind])
        gen2_string = str(genes2["genotype"][genes2.index[genes2["id"] == genes1["id"][ind]].tolist()[0]])
        if gen1_string == gen2_string or gen1_string == "".join(reversed(gen2_string)):
            match += 1

thank for any helps in advance

I have tried to parse the file into a list that both text 1 and text 2 shared, and a list that only exist in text 1 and text 2 using:

# genes1_and_2 = genes1[genes1["rsid"] == genes2["rsid"]]
# genes1_only = genes1[genes1["rsid"] != genes2["rsid"]]
# genes2_only = genes2[genes1["rsid"] != genes2["rsid"]]

However, this is not quite working and I believe it is not showing much performance improvement

here is an example of the text file, separated by " "

rs12564807  1   AA
rs3131972   1   GG
rs148828841 1   CT
rs12124819  Y   AG
rs115093905 X   GG
Kevin Lin
  • 1
  • 1
  • Could you please describe in words what you want to perform? You don't want to change any values, right? You just want to count something via `match` increasing for every ... ? – Tarquinius Mar 01 '23 at 17:24
  • yes, my goal is just to count the number of matches, match is defined when the same id exist in both text1 and text2, and for the row with the same id, if the genotype match, increment match by 1, if they don't match, increment mismatch by 1. I also have a similar metric for tracking if the id only exist in 1 file (not shown in the code, but it is essentially an else statement after the " if genes1["id"][ind] in genes2["id"].values" – Kevin Lin Mar 01 '23 at 17:32
  • Let's say text1 has one entry with ID=3456. Is it possible that in text2 ID=3456 is found multiple times? Or are IDs unique? – Tarquinius Mar 01 '23 at 17:36
  • ids are unique. I added an example of what the file looks like in the post. – Kevin Lin Mar 01 '23 at 17:38

2 Answers2

0

Numpy does not have a dataframe type, so it cannot be used to directly replace pandas. Using other dataframe implementation may improve performance and I recommend trying Polars. In my experience it gives 2-5 times speedup over pandas.

However, in your case I think the performance problem is in the algorithm. It creates a series when genes2["id"] == genes1["id"][ind] is evaluated on each iteration, so the algorithm effectively has quadratic complexity.

You can try to join genes1 and genes2 tables on "id" column:

genes1.set_index("id").join(genes2.set_index("id"), lsuffix="_1", rsuffix="_2")

and then iterate over the resulting dataframe.

Complete example:

import pandas as pd

genes1 = pd.read_csv("gen1.txt", sep=" ", names=["id", "chromosome", "genotype"])
genes1 = genes1[~genes1.chromosome.isin(['X', 'Y'])]  # filter out unwanted data
genes2 = pd.read_csv("gen2.txt", sep=" ", names=["id", "chromosome", "genotype"])
genes2 = genes2[~genes2.chromosome.isin(['X', 'Y'])]  # filter out unwanted data

genes = genes1.set_index("id").join(genes2.set_index("id"), lsuffix="_1", rsuffix="_2")

match = 0
for (id, row) in genes.iterrows():  # loop through each row of the dataframe
    gen1_string = str(row["genotype_1"])
    gen2_string = str(row["genotype_2"])
    if gen1_string == gen2_string or gen1_string == "".join(reversed(gen2_string)):
        match += 1

print(match)
tla
  • 855
  • 1
  • 14
  • Thank you for the suggestion, it works! I also found out that using a dictionary to map the id and genotype also make it a lot faster – Kevin Lin Mar 01 '23 at 21:59
0

I tried to answer tla's very nice approach with the join method, but my edit got rejected. I think you could further improve that answer avoiding the for-loop.

import pandas as pd

genes1 = pd.read_csv("gen1.txt", sep=" ", names=["id", "chromosome", "genotype"])
genes1 = genes1[~genes1.chromosome.isin(['X', 'Y'])]  # filter out unwanted data
genes2 = pd.read_csv("gen2.txt", sep=" ", names=["id", "chromosome", "genotype"])
genes2 = genes2[~genes2.chromosome.isin(['X', 'Y'])]  # filter out unwanted data

genes = genes1.set_index("id").join(genes2.set_index("id"), lsuffix="_1", rsuffix="_2")
# credit to tla for above

filtering = (
    (genes["genotype_1"] == genes["genotype_2"]) |
    (genes["genotype_1"] == genes["genotype_2"].apply(lambda x: "".join(reversed(x))))
)
match = filtering.values.sum()

print(match)

Looping through dataframes is considered slow, so I would apply some filters. filtering contains a series with booleans. True when it is a match, False when it is not. filtering.values.sum() just counts the number of True.

Tarquinius
  • 1,468
  • 1
  • 3
  • 18
  • I rejected the edit because I tried it and `or` returns a single boolean instead of operating on the elements of Series. See related question: https://stackoverflow.com/questions/21415661/logical-operators-for-boolean-indexing-in-pandas – tla Mar 02 '23 at 10:16
  • You are correct. I adjusted my answer. Do you then approve? – Tarquinius Mar 02 '23 at 10:47