Pandas left join with wildcard string match

Question

I'm new to using pandas. I'm trying to search for a substring in one dataframe using a string from a different dataframe.

screenshot of dataframes

Then, I want to merge those two dataframes based upon this match. When merging, for the rows in one dataframe that don't match up with a row in the other dataframe, I want those rows to display Null values.

Basically, I want to take the data from the parcel column and use it to search the the strings within the parcel_id column of the other dataframe. If a match is found, I want to merge these 2 rows into 1 row. For those in the first dataframe that don't have a match in the other dataframe, I want Null values used in those rows when merging. How do I do this?

Add the dataframe as text rather than image. Create a [simple, reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). — Vishnudev Krishnadas, Nov 29 '20 at 12:26
Merging on a wildcard column is not possible as such i think, I would rather modify the columns so you have exact matches — FloLie, Nov 29 '20 at 12:27
From the images I suspect you try to machen parcel from parcels and parcel_id from sales, which is a compound of parcel address and city. If the parcel field is always equally defined (2 integers - white space - 12 characters) you could just do x[:15] — FloLie, Nov 29 '20 at 12:31

FloLie · Accepted Answer · 2020-11-29T12:52:23.317

As you are new, here are some hints for asking such a question:

As pointed out in the comments, you should add a minimal example such as below, that everybode can import, look at and play around with.

Also you should provide the ways you already tried to do so.

First of, wildcard merges to my knowledge are not possible, therefore data cleaning is required. I did so with the sales['parcel_id_new'] column that is just the ìd part of the string.

Then you use a left join merge, as it takes all entries from the first left table and matches all from right leaving NAN in all others

parcels = pd.DataFrame([
    {'parcel': '01 01234566789AB', 'other column': 'Package A'},
    {'parcel': '02 01234566789AB', 'other column': 'Package B'},
    ])

print(parcels)

sales = pd.DataFrame([
    {'parcel_id': '01 01234566789AB OTHER UNIMORTANT TEXT', 'other sales column': 'Sales A'},
    ])

sales['parcel_id_new'] = sales['parcel_id'].apply(lambda x: x[:16]) 
print(sales)

full_table = pd.merge(how='left',
         left= parcels, 
         right=sales, 
         left_on='parcel', right_on='parcel_id_new')
print(full_table)

Output:

PARCELS:
   parcel              other column
0  01 01234566789AB    Package A
1  02 01234566789AB    Package B

SALES:
   parcel_id other                         sales column     parcel_id_new
0  01 01234566789AB OTHER UNIMORTANT TEXT  Sales A          01 01234566789AB

RESULT:
             parcel    other column  parcel_id                               other sales column     parcel_id_new  
0  01 01234566789AB    Package A     01 01234566789AB OTHER UNIMORTANT TEXT  Sales A                01 01234566789AB 
1  02 01234566789AB    Package B     NaN                                     NaN                    NaN

This helped out a lot. It's not exactly what I needed, but enough to get me there. Thanks! — NoSoyTuMadre, Nov 29 '20 at 13:01
Welcome! If you tell me what assumption is wrong, I am happy to adjust the answer, otherwise I am happy for an upvote — FloLie, Nov 29 '20 at 13:07
Oh...the answer is great! I just had to adjust it to an outer join. What do the suffixes do? — NoSoyTuMadre, Nov 29 '20 at 13:20
Perfect, the suffixes are just in case you have equally named columns in both tables (such as 'id', 'name', etc.) if those occur, they are suffixed with these values --> id_right, id_left for example — FloLie, Nov 29 '20 at 13:25

Pandas left join with wildcard string match

1 Answers1