0

I want to check two words exist in the same list simultaneously.

For example

I have a word list just like

word_list = [I have a dream, I am a dreamer]

and have a dataframe named df like

df

# word1  word2
#  have   dream
#  basketball player

I want to check two words exist in the same list simultaneously. So I wrote my code like this

for index, row in df.iterrows():
    text = []
    tokenized = word_list.split()
    for tokenized_word in tokenized:
        if row["word1"] == tokenized_word:
                    for tokenized_word in tokenized:
                        if row["word2"] == tokenized_word:
                            text.append(word_list)

If the list have many many elements and the dataframe has many words, it would spend many time to run this code. Anyway to faster my code?

Z.L
  • 147
  • 8
  • 3
    you should take a look here : https://stackoverflow.com/questions/53979403/search-for-a-value-anywhere-in-a-pandas-dataframe – Panda50 Oct 05 '20 at 15:25
  • Is `if row["word1"] in tokenized and row["word2"] in tokenized: text.append(word_list)` what you are looking for? Pandas has some tools for you too, if you need a faster, more complicated solution. – Niklas Mertsch Oct 05 '20 at 15:31
  • (1) Build a word list. (2) Count how many times each word appears. (3) Flag any "interesting" word that appears more than once. – Prune Oct 05 '20 at 15:40

1 Answers1

1

I would do it like:

tokens = set(word_list.split())
text = [
    word_list for _, row in df.iterrows() 
    if row["word1"] in tokens and row["word2"] in tokens
]

Since word_list never changes, you only have to build a set out of it once, and then every word in tokens check after that is constant-time instead of requiring an iteration over the entire list.

Note that I'm not sure if this is actually the list you want to build (the same copy of word_list repeated over and over) but it's what your original loop does. :)

Samwise
  • 68,105
  • 3
  • 30
  • 44