Python Pandas compare two dataframes with a similar (string) column, where 1 df's values are substrings of the other df's values

Question

I have a large CSV with phone number prefixes, and corresponding country names. Then I have a smaller dataframe with prefixes, and corresponding country names that are spelled wrong. I need to find a way to replace the smaller df's country names with the ones from the CSV. Unfortunately, I can't simply compare csv_df['prefix'] == small_df['prefix'].

The smaller dataframe contains very long, specific prefixes ("3167777"), while the CSV has less specific prefixes ("31"). As you can imagine, a prefix like 3167777 is a subcategory of 31, and therefore belongs to the CSV's 31-group. However, if the CSV contains a more specific prefix ("316"), then the longer prefix should only belong to that group.

Dummy example data:

# CSV-df with correct country names
prefix      country
93,         Portugal
937,        Portugal Special Numbers
31,         Netherlands
316,        Netherlands Special Numbers

# Smaller df with wrong country names
prefix      country
93123,      PT
9377777,    PT
3121,       NL
31612345,   NL
31677777,   NL
31688888,   NL

Ideal result:

# Smaller df, but with correct country names
prefix      country
9312345,    Portugal
9377777,    Portugal Special Numbers
3121,       Netherlands
31612345,   Netherlands Special Numbers
31677777,   Netherlands Special Numbers
31688888,   Netherlands Special Numbers

So far, I have a very slow, non-elegant solution: I've put the smaller df's prefixes in a list, and loop through the list, constantly .

# small_df_prefixes = ['9312345', '3167', '31612345', ...]
for prefix in small_df_prefixes:
    # See if the prefix occurs in the CSV
    if prefix in [p[:len(prefix)] for p in csv_df['prefix']]:
        new_dest = csv_df.loc[csv_df['prefix'] == prefix]['destination'].values[0]
    else:
        new_dest = 'UNKNOWN'

This runs slowly and does not give me the solution I need. I've looked into panda's startswith and contains functions but can't quite figure out how to use them correctly here. I hope someone can help me out with a better solution.

Is is guaranteed that the prefixed are unique? Moreover, if they are unique, why would you have `Portugal` two times, both starting with `93`? Wouldn't it be easier if you had a key mapping like `{'PT': 'Portugal', 'NL': 'Netherlands'}`? — Felipe Whitaker, Oct 06 '21 at 20:46
Ah I made the dummy data a bit too dummy. Basically, there's some large prefix-categories in the CSV, every category is basically a phone provider in the country. There are so many providers that mapping them manually would be impossible... A prefix never occurs more than once in a df. However, the more provider-specific, long prefixes (316777) do have overlap with the basic country prefixes (316). — Sinceris, Oct 06 '21 at 20:51
But do you need to map `93` to `Portugal` and `937` to `Portugal Special Numbers`? — Felipe Whitaker, Oct 06 '21 at 20:53
Yes I do. So if I have new phone number, for example: 93777, it would be long to Portugal Special Numbers. But 93111 would belong to Portugal. — Sinceris, Oct 06 '21 at 20:56

Felipe Whitaker · Answer 1 · 2021-10-07T01:28:48.627

0

I believe that you should take a look into pandas.DataFrame.apply. It enable you to apply a function either column or row wise on a DataFrame.

The idea here is to order df_right (which has the right names) by the lenght of the prefix (the longer the more specific) and then tries to exactly match the prefix in df_wrong to a prefix in df_right.

import pandas as pd

df_wrong = pd.read_csv('path/to/wrong.csv')
# df_wrong
# prefix      country
# 9312345     PT
# 3167        NL
# ...         ...

df_right = pd.read_csv('path/to/right.csv')
# df_right
# prefix      country
# 93          Portugal
# 937         Portugal Special Numbers
# 31          Netherlands
# ...         ...

# because prefixes 93 != 937, we will sort values by their len, as 93 would contain 937
df_right.sort_values(by = 'prefix', key = lambda prefix: prefix.len(), inplace = True, ascending = False)

def correct_country(prefix):
    global df_right
    equal_match = df_right[df_right.prefix == prefix].country
    if equal_math.shape[0] == 0: # no matches
         return 'UNKNOWN'
    return equal_match[0]

df_wrong['corrected_country'] = df_wrong['prefix'].apply(correct_prefix)

Resources:

Sort Dataframe by String Lenght

edited Oct 07 '21 at 01:28

answered Oct 06 '21 at 21:07

Felipe Whitaker

470
3
9

I've tried running this but I'm getting some errors... Especially on the line: `df_right.sort_values(...` the lambda function is giving me a keyerror: 'prefix' – Sinceris Oct 06 '21 at 22:22
Ok after some tweaking i got it to work... However, the correct_country() function only looks at exact matches (example: 3167777 == 317777). However, I need it to look at the first n digits of the prefix: if 3167777 doesnt exist but 316 does, then I need 316777 to get 316's country label. – Sinceris Oct 06 '21 at 22:35
The line `equal_match = df_right[df_right.prefix == prefix].country` is the only one that needs changing. Instead of `==`, I need something that says "the longest prefix in df_right that matches the first digits of the current prefix". // Ive thought of using 'startswith', but I don't know how to apply it to get the result I want (the country). – Sinceris Oct 06 '21 at 22:43
I'm sorry about the `prefix` thing: I was not sure if pandas would pass the entire DataFrame or just the column. I believe that, instead of using the boolean part (inside the brackets), use `df_right.prefix.apply(lambda p: p.startswith(prefix))` which will return a boolean series where that column starts with the prefix you want. Then, you just need to get the lowest match (index `-1`) or get the first one if `ascending = True`. Hope this helps! Good coding! – Felipe Whitaker Oct 07 '21 at 01:27

Python Pandas compare two dataframes with a similar (string) column, where 1 df's values are substrings of the other df's values

1 Answers1