Join big dataframes based on partial string-match between columns

Question

Two DataFrames have gene and isoform names that are not formatted the same way. I'd like to do a join and add the df2 columns name, isoform for all partial string matches between the isoform (df2) and the name (df1) in both DataFrames. df2 is a key for the isoforms/genes, where a gene can have many isoforms. In df1, basically an output from a gene-quantification software (SALMON) the name field has both, the gene and isoform in it. I cant use regex since isoforms have variable suffixs, such as ".","_", "-", and many others. Another important piece of information is that each df1["Name"] cell has a unique isoform.

Piece of dfs to merge:

import pandas as pd

df1 = pd.DataFrame({'Name': {0: 'AT1G01010;AT1G01010.1;Isoseq::Chr1:3616-5846', 1: 'AT1G01010;AT1G01010_2;Isoseq::Chr1:3630-5894', 2: 'AT1G01010;AT1G01010.3;Isoseq::Chr1:3635-5849', 3: 'AT1G01020;AT1G01020.11;Isoseq::Chr1:6803-8713', 4: 'AT1G01020;AT1G01020.13;Isoseq::Chr1:6811-8713'}, 'Length': {0: 2230, 1: 2264, 2: 2214, 3: 1910, 4: 1902}, 'EffectiveLength': {0: 1980.0, 1: 2014.0, 2: 1964.0, 3: 1660.0, 4: 1652.0}, 'TPM': {0: 2.997776, 1: 1.58178, 2: 0.0, 3: 4.317311, 4: 0.0}, 'NumReads': {0: 154.876, 1: 83.124, 2: 0.0, 3: 187.0, 4: 0.0}})
df2 = pd.DataFrame({'gene': {0: 'AT1G01010', 14: 'AT1G01010', 30: 'AT1G01010', 46: 'AT1G01020', 62: 'AT1G01020', 80: 'AT1G01020', 100: 'AT1G01020', 116: 'AT1G01020', 138: 'AT1G01020', 156: 'AT1G01020'}, 'isoform': {0: 'AT1G01010.1', 14: 'AT1G01010_2', 30: 'AT1G01010.3', 46: 'AT1G01020.1', 62: 'AT1G01020.10', 80: 'AT1G01020.11', 100: 'AT1G01020.12', 116: 'AT1G01020.13', 138: 'AT1G01020.14', 156: 'AT1G01020.15'}})
display(df1)
display(df2)

Desired output:

df3 = pd.DataFrame({'gene': {0: 'AT1G01010', 1:"AT1G01010", 2:"AT1G01010", 3:"AT1G01020", 4:"AT1G01020"},'isoform': {0: 'AT1G01010.1',1:"AT1G01010_2", 2:"AT1G01010.3", 3:"AT1G01020.11", 4:"AT1G01020.13"}, 'Length': {0: 2230, 1: 2264, 2: 2214, 3: 1910, 4: 1902}, 'EffectiveLength': {0: 1980.0, 1: 2014.0, 2: 1964.0, 3: 1660.0, 4: 1652.0}, 'TPM': {0: 2.997776, 1: 1.58178, 2: 0.0, 3: 4.317311, 4: 0.0}, 'NumReads': {0: 154.876, 1: 83.124, 2: 0.0, 3: 187.0, 4: 0.0}})
#"Name" column from df1 is not necessary anymore. (the idea is to replace it for gene and isoform)
display(df3)

Real dfs size:

df1 = 143646 rows × 5 columns

df2 = 169499 rows × 2 columns

(since df1 may not have all the isoforms detected, it's always smaller than df2)

I tried some answers i found online, but since this dfs have a huge size, many need 50gb of RAM or so...

Already checked: Merge Dataframes Based on Partial Substrings Match, Join to Dataframes based on partial string matches in python, Join dataframes based on partial string-match between columns

Thanks for the help!

Is it expected that `AT1G01020.1` matches `AT1G01020.11`? and `AT1G01020.10` matches `AT1G01020.13`? — mozway, Mar 02 '22 at 13:08
No, sorry (edited): **AT1G01020.11** and **AT1G01020.1** are different isoforms. I added more rows to df2 to clarify this example. Thanks! — Lucas Servi, Mar 02 '22 at 13:12
Yes. Partial is requested since df1 has a large field that may not always be separated by ";". Isoform suffixes may also contain leters such as "_ID2". — Lucas Servi, Mar 02 '22 at 13:16
How else can it be separated? You need to come a with a logic at some point ;) — mozway, Mar 02 '22 at 13:17
For sure it's a special character (such as ";", ",", "-"), sometimes (the easy ones), the output just gives you the isoform. This depends on the Annotation used, this may vary from species to species, my idea behind this is to be as general as possible. *Special character* as separators is a good way to go i guess. (I've been struggling many days with this...). Thanks! — Lucas Servi, Mar 02 '22 at 13:22

Join big dataframes based on partial string-match between columns

0 Answers0