I have two Dataframes. The first Dataframe,df_globe
, looks like this:
Country Region
0 Andorra Europe
1 Andorran Europe
2 Andorrano Europe
3 United Arab Emirates MENA
4 Saudi Arabia MENA
5 u.a.e MENA
6 Democratic People's Republic of Korea, East Asia
6 Puerto Rico, Americas
.. ... ...
539 Americas Americas
540 MENA MENA
541 South Asia South Asia
542 Sub-Saharan Africa Sub-Saharan Africa
543 Pacific Pacific
The second Dataframe df_tweets
looks like this:
sentiment id date text
0 0 1598814115664994307 2022-12-02 22:59:24+00:00 I think it is in South Asia
1 0 1598814115664994307 2022-12-02 22:59:24+00:00 Say hello to my friend in Puerto Rico
df_globe = pd.read_csv('fuzzyCountriesAndRegions.csv')
df_tweets = pd.read_csv('tweets.csv')
df_result= pd.DataFrame(columns=['sentiment', 'id', 'date', 'text'])
print(df_globe)
print(df_tweets)
word_list = []
for tweetIndex in range(len(df_tweets['text'])):
word_tokens = word_tokenize(df_tweets['text'][tweetIndex])
for token in word_tokens:
for name in df_globe['Country']:
ratio = fuzz.token_set_ratio(token, name)
if ratio >= 90:
df_result = df_result.append(df_tweets.iloc[[tweetIndex]])
word_list.append(token)
df_result['word'] = word_list
print(df_result)
df_result
looks like this when I run my code:
sentiment id date text word
0 0 1598814115664994307 2022-12-02 22:59:24+00:00 I think it is in South Asia South
0 0 1598814115664994307 2022-12-02 22:59:24+00:00 I think it is in South Asia South
0 0 1598814115664994307 2022-12-02 22:59:24+00:00 I think it is in South Asia South
0 0 1598814115664994307 2022-12-02 22:59:24+00:00 I think it is in South Asia South
0 0 1598814115664994307 2022-12-02 22:59:24+00:00 I think it is in South Asia South
0 0 1598814115664994307 2022-12-02 22:59:24+00:00 I think it is in South Asia South
0 0 1598814115664994307 2022-12-02 22:59:24+00:00 I think it is in South Asia South
0 0 1598814115664994307 2022-12-02 22:59:24+00:00 I think it is in South Asia South
0 0 1598814115664994307 2022-12-02 22:59:24+00:00 I think it is in South Asia Asia
0 0 1598814115664994307 2022-12-02 22:59:24+00:00 I think it is in South Asia Asia
0 0 1598814115664994307 2022-12-02 22:59:24+00:00 I think it is in South Asia Asia
1 0 1598814115664994307 2022-12-02 22:59:24+00:00 Say hello to my friend in Puerto Rico Puerto
1 0 1598814115664994307 2022-12-02 22:59:24+00:00 Say hello to my friend in Puerto Rico Puerto
1 0 1598814115664994307 2022-12-02 22:59:24+00:00 Say hello to my friend in Puerto Rico Puerto
1 0 1598814115664994307 2022-12-02 22:59:24+00:00 Say hello to my friend in Puerto Rico Rico
1 0 1598814115664994307 2022-12-02 22:59:24+00:00 Say hello to my friend in Puerto Rico Rico
But it df_result
should look like (example of desired output):
sentiment id date text word
0 0 1598814115664994307 2022-12-02 22:59:24+00:00 I think it is in South Asia South Asia
1 0 1598814115664994307 2022-12-02 22:59:24+00:00 Say hello to my friend in Puerto Rico Puerto Rico
I've thought of possibly trying to take out all spaces, perform the matching, and then retokenizing the tweets, but I haven't been able to find a way to retokenize them if they don't have spaces anymore. I don't think doing that would be possible.
Another thought would be to make a copy of the tweet called tweet_copy, take the spaces and punctuation out of that, search all of its substrings for matches with df_globe[Country]
, then if it matched, put the original tweet
in df_result
along with the matched word.
These are some links I have looked at, but none have been helpful: ignore spaces in a substring , tokenize without whitespace , tokenize continuous words with no whitespace