How can I iterate through substrings to find if it matches any entries in my Dataframes without ignoring spaces?

Question

I have two Dataframes. The first Dataframe,df_globe, looks like this:

                  Country                Region
0                 Andorra                Europe
1                Andorran                Europe
2               Andorrano                Europe
3    United Arab Emirates                 MENA
4            Saudi Arabia                 MENA
5                    u.a.e                MENA
6 Democratic People's Republic of Korea,  East Asia
6               Puerto Rico,              Americas
..                    ...                 ...
539              Americas            Americas
540                  MENA               MENA
541            South Asia          South Asia
542    Sub-Saharan Africa  Sub-Saharan Africa
543               Pacific             Pacific

The second Dataframe df_tweets looks like this:

   sentiment                   id                       date                                               text
0          0  1598814115664994307  2022-12-02 22:59:24+00:00  I think it is in South Asia
1          0  1598814115664994307  2022-12-02 22:59:24+00:00  Say hello to my friend in Puerto Rico

df_globe = pd.read_csv('fuzzyCountriesAndRegions.csv')
df_tweets = pd.read_csv('tweets.csv')
df_result= pd.DataFrame(columns=['sentiment', 'id', 'date', 'text'])


print(df_globe)
print(df_tweets)

word_list = []
for tweetIndex in range(len(df_tweets['text'])):
    word_tokens = word_tokenize(df_tweets['text'][tweetIndex])   
    for token in word_tokens:       
        for name in df_globe['Country']:  
            ratio = fuzz.token_set_ratio(token, name)     
            if ratio >= 90:                                                        
                df_result = df_result.append(df_tweets.iloc[[tweetIndex]])  
                word_list.append(token)

df_result['word'] = word_list


print(df_result)

df_result looks like this when I run my code:

  sentiment                   id                       date                                   text    word
0         0  1598814115664994307  2022-12-02 22:59:24+00:00            I think it is in South Asia   South
0         0  1598814115664994307  2022-12-02 22:59:24+00:00            I think it is in South Asia   South
0         0  1598814115664994307  2022-12-02 22:59:24+00:00            I think it is in South Asia   South
0         0  1598814115664994307  2022-12-02 22:59:24+00:00            I think it is in South Asia   South
0         0  1598814115664994307  2022-12-02 22:59:24+00:00            I think it is in South Asia   South
0         0  1598814115664994307  2022-12-02 22:59:24+00:00            I think it is in South Asia   South
0         0  1598814115664994307  2022-12-02 22:59:24+00:00            I think it is in South Asia   South
0         0  1598814115664994307  2022-12-02 22:59:24+00:00            I think it is in South Asia   South
0         0  1598814115664994307  2022-12-02 22:59:24+00:00            I think it is in South Asia    Asia
0         0  1598814115664994307  2022-12-02 22:59:24+00:00            I think it is in South Asia    Asia
0         0  1598814115664994307  2022-12-02 22:59:24+00:00            I think it is in South Asia    Asia
1         0  1598814115664994307  2022-12-02 22:59:24+00:00  Say hello to my friend in Puerto Rico  Puerto
1         0  1598814115664994307  2022-12-02 22:59:24+00:00  Say hello to my friend in Puerto Rico  Puerto
1         0  1598814115664994307  2022-12-02 22:59:24+00:00  Say hello to my friend in Puerto Rico  Puerto
1         0  1598814115664994307  2022-12-02 22:59:24+00:00  Say hello to my friend in Puerto Rico    Rico
1         0  1598814115664994307  2022-12-02 22:59:24+00:00  Say hello to my friend in Puerto Rico    Rico

But it df_result should look like (example of desired output):

   sentiment                   id                       date           text                              word
0          0  1598814115664994307  2022-12-02 22:59:24+00:00  I think it is in South Asia             South Asia
1          0  1598814115664994307  2022-12-02 22:59:24+00:00  Say hello to my friend in Puerto Rico  Puerto Rico

I've thought of possibly trying to take out all spaces, perform the matching, and then retokenizing the tweets, but I haven't been able to find a way to retokenize them if they don't have spaces anymore. I don't think doing that would be possible.

Another thought would be to make a copy of the tweet called tweet_copy, take the spaces and punctuation out of that, search all of its substrings for matches with df_globe[Country], then if it matched, put the original tweet in df_result along with the matched word.

These are some links I have looked at, but none have been helpful: ignore spaces in a substring , tokenize without whitespace , tokenize continuous words with no whitespace

This is a rather brittle approach. You might want to try [geoparsing](https://github.com/openeventdata/mordecai). — RJ Adriaansen, Dec 07 '22 at 23:44
I'm looking through this article https://towardsdatascience.com/geoparsing-with-python-and-natural-language-processing-4762a7c92f08 about it, and it may suit my purposes perfectly. Thanks @RJAdriaansen! — FrustratedSnake, Dec 08 '22 at 00:16

How can I iterate through substrings to find if it matches any entries in my Dataframes without ignoring spaces?

0 Answers0