0

I want to create a list of synonym words, and have python search for co-locations, but I don't need every single word in the list to show up. I just need one of the words from each list to be present.

I have conversations threads from an online forum in a csv file. I have been looking for words that co-locate in these threads. For example, if the words "fraud" and "flag" show up in the same post (see code below), then I can grab those posts.

def searchWord(word, string):     
    if word in string:
        return 1
    else:
        return 0

 df['flag'] = df['Post_Text_lowercase'].apply(lambda x: 1 if 'flag' in x    else 0)
 df['flag'] = df['Post_Text_lowercase'].apply(lambda x: searchWord('flag', x))

 df['flag'].value_counts()

 df['fraud'] = df['Post_Text_lowercase'].apply(lambda x: 1 if 'fraud' in x else 0)

 df['flag_fraud'] = df['flag'] & df['fraud']

 df_flag_fraud = df[ df['flag_fraud'] == 1 ].copy()

 df_flag_fraud['Post_Text'].values

However, now I want to search for say ['fraud', 'scammer', 'sham', 'user'] and ['flag'], but only need one of the words in the first list to show up, not all of them. How do I do this? (Mind you, I realize that I can stem or lemmentize, but it doesn't grab exactly what I am looking for). Since I am a newbie, I realize that my code might not be the best way here.

Thanks to all. I am learning so much from you

glongo
  • 31
  • 7

1 Answers1

1

So you would like to provide one or more words and return all posts which contain either the word(s), or synonyms of the word(s), you provided.

It appears finding synonyms of a word is tricky. Some have suggested using WordNet. I'll assume the synonyms you need to consider are more limited, and you're alright with all the importance of synonyms within a given group having the same relationships and weight.

Then you can approach it using graph data structures and develop one graph for each set of word with the same meaning. I'm using the implementation of Graph found in this post below – probably overkill but helps get the point across quickly.

# Produce a set of synonyms for each idea you're interested in searching for
synonyms = ["scam", "fraud", "sham", "swindle", "hustle", "racket"]

# Create a complete graph which connects each word in
# a group of synonyms to every other with equal weight
synonyms = [(x, y) for x in synonyms for y in synonyms]
print synonyms
'''
[('scam', 'scam'),
 ('scam', 'fraud'),
 ('scam', 'sham'),
 ('scam', 'swindle'),
 ('scam', 'hustle'),
 ('scam', 'racket'),
 ('fraud', 'scam'),
 ('fraud', 'fraud'),
 ('fraud', 'sham'),
 ('fraud', 'swindle'),
 ('fraud', 'hustle'),
 ('fraud', 'racket'),
 ('sham', 'scam'),
 ('sham', 'fraud'),
 ('sham', 'sham'),
 ('sham', 'swindle'),
 ('sham', 'hustle'),
 ('sham', 'racket'),
 ('swindle', 'scam'),
 ('swindle', 'fraud'),
 ('swindle', 'sham'),
 ('swindle', 'swindle'),
 ('swindle', 'hustle'),
 ('swindle', 'racket'),
 ('hustle', 'scam'),
 ('hustle', 'fraud'),
 ('hustle', 'sham'),
 ('hustle', 'swindle'),
 ('hustle', 'hustle'),
 ('hustle', 'racket'),
 ('racket', 'scam'),
 ('racket', 'fraud'),
 ('racket', 'sham'),
 ('racket', 'swindle'),
 ('racket', 'hustle'),
 ('racket', 'racket')]
'''
synonyms_graph = Graph(synonyms)
print synonyms_graph
'''
Graph({'fraud': set(['fraud', 'scam', 'sham', 'racket', 'hustle', 'swindle']), 'scam': set(['fraud', 'scam', 'sham', 'racket', 'hustle', 'swindle']), 'racket': set(['fraud', 'scam', 'sham', 'racket', 'hustle', 'swindle']), 'swindle': set(['fraud', 'scam', 'sham', 'racket', 'hustle', 'swindle']), 'hustle': set(['fraud', 'scam', 'sham', 'racket', 'hustle', 'swindle']), 'sham': set(['fraud', 'scam', 'sham', 'racket', 'hustle', 'swindle'])})
'''

# Add another network of synonyms for a different topic
synonyms = ["flag", "problem", "issue", "worry", "concern"]
synonyms_graph.add_connections([(x, y) for x in synonyms for y in synonyms])

# Now you can provide any word in a synonym group and get all the other synonyms
print synonyms_graph._graph["scam"]
'''
set(['fraud', 'scam', 'sham', 'racket', 'hustle', 'swindle'])
'''
print synonyms_graph._graph["flag"]
'''
set(['flag', 'issue', 'problem', 'worry', 'concern'])
'''

# Apply this to your dataframe
df["Post_Word_Set"] = df["Post_Text_lowercase"].apply(lambda x: set(x.split()))
df["scam"] = df.apply(lambda x: 1 if x.Post_Word_Set.intersection(synonyms_graph._graph["scam"]) else 0, axis=1)
df["flag"] = df.apply(lambda x: 1 if x.Post_Word_Set.intersection(synonyms_graph._graph["flag"]) else 0, axis=1)
posts_of_interest = df[(df.scam.values == 1) | (df.flag.values == 1)].Post_Text_lowercase

print posts_of_interest
'''
This makes me worry.  Maybe it's a scam
Big red flag here
So sick of all this fraud
Name: Post_Text_lowercase, dtype: object
'''

Definitely some potential to optimize here. You mentioned stemming and lemmatization, that's probably a good idea to incorporate. I'd also consider removing punctuation so you don't miss things like "this is a scam.".

Community
  • 1
  • 1
conner.xyz
  • 6,273
  • 8
  • 39
  • 65
  • Actually I am trying to figure out how to create a list or set of words in the text where if any of the words are present than it is marked as a one. For example, [scam OR fraud Or sham Or stunt]. Does this make sense? Thanks – glongo Jul 12 '16 at 21:40
  • @GinaBoBina I don't understand. I was confused by "I just need one of the words from each list to be present" initially. Now not sure what you mean by "marked as one"... Are you looking for a `set`? `list(set(df_flag_fraud['Post_Text'].values[0]))` or `np.unique(df_flag_fraud['Post_Text'].values)` – conner.xyz Jul 12 '16 at 21:47
  • I am sorry that I am unclear. Okay, so, I have about 2.2 million conversation posts. I want to sample a set of these posts based on how certain words colocate. For example, post one: "I think that his age is a red flag, and that this is a scam." Based on the words "flag" and "scam", this would be one I want to sample. However, there are several different words for scam, like fraud, sham, etc. So I want to give Python a set of words and say look for any of these words in my posts. How do I create a way for python to look for a synonym. Does this make more sense? Thanks – glongo Jul 13 '16 at 15:10
  • Yeah that makes a lot more sense. See the edit above. – conner.xyz Jul 14 '16 at 18:18
  • @GinaBoBina what do you think? – conner.xyz Jul 15 '16 at 01:28
  • I am working on this now, and this seems like it is going to be exactly what I need. Just one question, is the post_of_interest variable a post or a thread. Thread is an initial post and all of its replies, versus a post, which is just one reply in a thread. This is great code by the way, so clear! Thanks! – glongo Jul 15 '16 at 14:37
  • `posts_of_interest` is a subset of `df`, a data frame that has a column of strings called "Post_Text_lowercase" (I overlooked the lowercase part). Apply that to your problem as you see fit. You could consider flattening a thread to `df` for example. – conner.xyz Jul 15 '16 at 14:52