I am using pandas frame to read in two text file that has 600K rows each and try to find the rows that has matching values between the two files. My current implementation is incredible slow, and I was wondering if there is anyway to use numpy to speed this up?
genes1 = pd.read_csv("gen1.txt", sep=" ", names=["id", "chromosome", "genotype"])
genes1 = genes1[~genes1.chromosome.isin(['X', 'Y'])] # filter out unwanted data
genes2 = pd.read_csv("gen2.txt", sep=" ", names=["id", "chromosome", "genotype"])
genes2 = genes2[~genes2.chromosome.isin(['X', 'Y'])] # filter out unwanted data
for ind in genes1.index: # loop through each row of the dataframe
if genes1["id"][ind] in genes2["id"].values: # if the given id of genes1 exist in genes2
gen1_string = str(genes1["genotype"][ind])
gen2_string = str(genes2["genotype"][genes2.index[genes2["id"] == genes1["id"][ind]].tolist()[0]])
if gen1_string == gen2_string or gen1_string == "".join(reversed(gen2_string)):
match += 1
thank for any helps in advance
I have tried to parse the file into a list that both text 1 and text 2 shared, and a list that only exist in text 1 and text 2 using:
# genes1_and_2 = genes1[genes1["rsid"] == genes2["rsid"]]
# genes1_only = genes1[genes1["rsid"] != genes2["rsid"]]
# genes2_only = genes2[genes1["rsid"] != genes2["rsid"]]
However, this is not quite working and I believe it is not showing much performance improvement
here is an example of the text file, separated by " "
rs12564807 1 AA
rs3131972 1 GG
rs148828841 1 CT
rs12124819 Y AG
rs115093905 X GG