I have a large CSV with phone number prefixes, and corresponding country names.
Then I have a smaller dataframe with prefixes, and corresponding country names that are spelled wrong. I need to find a way to replace the smaller df's country names with the ones from the CSV. Unfortunately, I can't simply compare csv_df['prefix'] == small_df['prefix']
.
The smaller dataframe contains very long, specific prefixes ("3167777"), while the CSV has less specific prefixes ("31"). As you can imagine, a prefix like 3167777 is a subcategory of 31, and therefore belongs to the CSV's 31-group. However, if the CSV contains a more specific prefix ("316"), then the longer prefix should only belong to that group.
Dummy example data:
# CSV-df with correct country names
prefix country
93, Portugal
937, Portugal Special Numbers
31, Netherlands
316, Netherlands Special Numbers
# Smaller df with wrong country names
prefix country
93123, PT
9377777, PT
3121, NL
31612345, NL
31677777, NL
31688888, NL
Ideal result:
# Smaller df, but with correct country names
prefix country
9312345, Portugal
9377777, Portugal Special Numbers
3121, Netherlands
31612345, Netherlands Special Numbers
31677777, Netherlands Special Numbers
31688888, Netherlands Special Numbers
So far, I have a very slow, non-elegant solution: I've put the smaller df's prefixes in a list, and loop through the list, constantly .
# small_df_prefixes = ['9312345', '3167', '31612345', ...]
for prefix in small_df_prefixes:
# See if the prefix occurs in the CSV
if prefix in [p[:len(prefix)] for p in csv_df['prefix']]:
new_dest = csv_df.loc[csv_df['prefix'] == prefix]['destination'].values[0]
else:
new_dest = 'UNKNOWN'
This runs slowly and does not give me the solution I need. I've looked into panda's startswith
and contains
functions but can't quite figure out how to use them correctly here. I hope someone can help me out with a better solution.