Pandas extract rows from df where df['col'] values match df2['col'] values

Question

Finding this one difficult and would like to see if you can help, pls:

I have two df's, dfA and dfB. In each there is a telnum, file, and datetime column, but only dfA contains a name column and only dfA has complete telephone number string values unlike dfB which sometimes has partially complete telephone numbers up to x of n digits, e.g. sometimes the int'l calling code is missing, other times a '0' is present instead of the int'l calling code, and other times neither int'l calling code or the leading '0' is present.

dfA also contains more data than dfB (<10 rows), but between the two there are complete timestamp, file, and telnum columns always with a date-time and string value respectively (even though dfB['telnum'] is not fully complete as stated above).

What I want to do is extract the rows from dfA where dfB['telnum'] matches dfA['telnum'] but because dfB['telnum'] is not always complete then I need to check for matches as a substring of dfA['telnum'] too.

I would like the result to be dfResult but where the returned results have dfA values on the left and dfB on the right so that I can see the different file and datetime values.

Any ideas?

EDIT:

I think I need an inner merge, e.g.

pandas.merge(dfA, dfB, on='telnum', how='inner')

However, because dfB['telnum'] isn't always complete telnum string, the results are not complete. How can I get those that match between the two by checking if dfB['telnum'] is a substring of dfA['telnum'] as well?

Could you share what code you've already tried and why I didn't work? — rajah9, Jul 23 '19 at 10:17

score 1 · Accepted Answer · answered Jul 23 '19 at 10:13

Before doing any complicated joins you need to clean your data. I'm not sure how it is in your country, but in my country, if you remove the international prefix, or the leading zeroes, a phone number becomes 9 digits long.

Here's what I suggest you do:

Clean your 'telnum' columns:

def remove_non_numbers(string):
    result = ""
    for char in string:
        if char.isnumeric():
            result += char
    return result

# Do this for both dataframes just in case
df["telnum"] = df["telnum"].apply(remove_non_numbers)

Remove any digits before the last 9:

def limit9(string):
    if len(string) > 9:
        return string[-9:]
    return string

# Do this for both dataframes
df["telnum"] = df["telnum"].apply(limit9)

Now you can try to merge your cleaned DFs:

dfResult = pd.merge(dfA, dfB, on='telnum', how='inner')

Good luck! Let me know if it works.

Nice approach. And I completely agree the solutions will largely depend on how the phone numbers look like in the data. Nevertheless, good job with the approach :) — Ankur Sinha, Jul 23 '19 at 10:16
Thanks. I was working on my code in the interim since my post, and I came to the same conclusion - remove the leading calling code or zero to leave behind the remaining digits then do the match on that. — uncle-junky, Jul 23 '19 at 10:25

score 1 · Answer 2 · answered Jul 23 '19 at 10:15

Since you have not provided any sample input or output, based on the description, I am making an attempt here. This is a multi fold problem and there can be many approaches depending on how your phone numbers look like.

Let me have two dataframes here (skipping your other columns): dfA and dfB.

dfA:

           telnum
0   0049123456789
1  00919444454555
2   0092789742893

dfB:

        telnum
0    123456789
1  09444454555

First, let's clean up the numbers in dfB. dfB can have numbers which are:

number with ISD code - like 00918888888888
number with 0 in front of it 08888888888
number alone - 8888888888

I am stripping off every number that starts with 0. So anything that has an international code 0091 becomes 091, or has 091 and becomes 91, and has 0888888888 becomes 888888888. What this does is, since your dfA has the complete number with ISD code which I assume is off the form 00xxNNNNNNNN - format, a substring match should be found easily.

dfB['telnum'] = dfB['telnum'].apply(lambda x : x[1:] if x.startswith("0") else x)

dfB now looks like (notice the first occuring 0 has been stripped):

       telnum  
0   123456789 
1  9444454555

Take all the numbers from dfB and convert them to a list:

tempList = dfB['telnum'].tolist()

Now look for the match in the other dataframe:

dfA[dfA.telnum.str.contains('|'.join(tempList))]

You should now get:

           telnum
0   0049123456789
1  00919444454555

So you have found those numbers for which you got a match. You can now proceed with these rows and perform the necessary join with other columns or anything you want.

Pandas extract rows from df where df['col'] values match df2['col'] values

2 Answers2

Linked