-2

I have a column with text. This text can contain the name of countries. I want to have a list of all countries mentioned in a column in the same row as the text. I already have a series with the countries I want to extract.

    SomeText                          | ... | .... | CountryInText
    Something Canada                  |     |      |   
    RUSSIAAreACountry                 |     |      |   
    Mexicoand Brazil is South of USA




    SomeText                          | ... | .... | CountryInText
    Something Canada                  |     |      |  Canada 
    RUSSIAAreACountry                 |     |      |  Russia
    Mexicoand Brazil is South of USA  |     |      |  Mexico, Brazil, USA

I've tried with

pd.Series(df['SomeText'].str.findall(f"({'|'.join(countryname['CommonName'])})"))

However, this gives me a list of object that I can't match back to the original dataframe. The countryname['CommonName'] is a series of country names.

Can anyone help me ?

Thanks in advance

bjornasm
  • 2,211
  • 7
  • 37
  • 62
  • Is [this](https://stackoverflow.com/questions/59522281/how-to-loop-through-pandas-df-column-finding-if-string-contains-any-string-from/59522472?noredirect=1#comment107292510_59522472) what you're looking for? – Erfan Mar 12 '20 at 22:41
  • Why would you use `findall`? What happens if you have two country names in `SomeText`? – Quang Hoang Mar 12 '20 at 23:05
  • It looks like what you actually want may be different from how you phrased it. Based on your example, it appears that what you want is the rightmost column of a particular row to consist of all countries that appear in the leftmost column of that row. Is that correct? – Acccumulation Mar 12 '20 at 23:10
  • @Acccumulation Yes that is correct, sorry - I am updating the question now. – bjornasm Mar 12 '20 at 23:26

2 Answers2

2

An solution (with a small testing example) using re package (for more flexibility):

import pandas as pd
import re

df = pd.DataFrame({"SomeText": ["Something Canada", "RUSSIAAreACountry"]})
countryname = pd.Series({"CommonName": ["Canada", "Russia"]})
df["CountryInText"] = df["SomeText"].str.title().map(lambda x: 
                                         re.findall('|'.join(countryname['CommonName']), x, re.I))

UPDATE (based on Erfan's feedback in the comment):

import pandas as pd
import re

df = pd.DataFrame({"SomeText": ["Something Canada", "RUSSIAAreACountry"]})
countryname = pd.Series({"CommonName": ["Canada", "Russia"]})
df["CountryInText"] = df["SomeText"].str.title().str.findall('|'.join(countryname['CommonName']), re.I)

UPDATE 2 (based on the useful additional test cases posted by OP):

The approaches above would return Usa instead of USA. The one below takes care of that:

import pandas as pd

df = pd.DataFrame({"SomeText": ["Something Canada",
                                "RUSSIAAreACountry", 
                                "Mexicoand Brazil is South of USA"]})
countryname = pd.Series({"CommonName": ["Canada", "Russia", "Mexico", "Brazil", "USA"]})
df["CountryInText"] = df["SomeText"].map(lambda x: [c for c in countryname['CommonName'] 
                                                    if c.lower() in x.lower()])
datapug
  • 2,261
  • 1
  • 17
  • 33
  • 1
    Better use a native pandas method: `Series.str.findall` – Erfan Mar 12 '20 at 22:51
  • Yes besides that, but it was general comment, it does not make sense to use `re.findall` when we have a native pandas method @QuangHoang – Erfan Mar 12 '20 at 23:16
  • Why the `.title()` and `.lower()` ? – AMC Mar 13 '20 at 00:27
  • @AMC the `title()` was for returning a country name with only the initials as capital letters (e.g. the OP example RUSSIA -> Russia). But that approach was not addressing properly the scenario added later (USA -> USA). The last approach takes care of that test case too and it uses `lower()` to make the match case insensitive. – datapug Mar 13 '20 at 11:40
1

A bit too late and a dupe but I wrote the code, so I may as well:)

import pandas as pd
import re
countryname = pd.DataFrame(
    data={
        "Name": ["Rep. of Congo", "Russia Long", "Canada Long"],
        "CommonName": ["Congo", "Russia", "Canada"]})
df = pd.DataFrame(
    data={
        "SomeText": ["Something Canada", "RUSSIAAreACountry", "Rep ofIreland", "Unrelated"],
        "CountryInText": ["","","",""]})
names = "|".join(list(countryname["CommonName"]))

Would give you:

countryname:

            Name CommonName
0  Rep. of Congo      Congo
1    Russia Long     Russia
2    Canada Long     Canada

df:

            SomeText CountryInText
0   Something Canada              
1  RUSSIAAreACountry              
2      Rep ofIreland              
3          Unrelated 

names:

Congo|Russia|Canada

Then using findall and a simple function you can find all instances of strings in the common names and if anything is found pick the first one and make it a title case, or return an empty string if nothing is found. This approach ignores the All cap options and changes everything to Title Case. I also saw the rightmost name addition after I wrote the answer, so that is out as well.

# re.I is there to do case insensitive matching
df["CountryInText"] = df["SomeText"].str.findall(names, flags = re.I)
def cleanup(country_list):
    if len(country_list) > 0:
        return str(country_list[0])
    return ""
df["CountryInText"] = df["CountryInText"].apply(cleanup).apply(str.title)

Now df:

            SomeText CountryInText
0   Something Canada        Canada
1  RUSSIAAreACountry        Russia
2      Rep ofIreland              
3          Unrelated              
Sinan Kurmus
  • 585
  • 3
  • 11