0

I have 2 csv file and i wanted to match the words from both, news.csv and dictionary.csv. If a word in news.csv exist in dictionary.csv, output 1. But since i have a lots of terms in my dictionary.csv and per row contains more than one word, i was unable to correctly match the words.

For example in news.csv i have these words

     STORY
According to the 2011 National Health and Nutritional Status Survey, 12.4 per cent of the country's citizens have diabetes.

And in my dictionary.csv i have these terms

      Terms
Diabetes Mellitus
Diabetes Inspidus

I should be getting 1 because the word diabetes exist in both of the csv files, but i didn't

I tried to join all the terms in dictionary.csv by using these code

news=pd.read_csv("news.csv")
dictionary=pd.read_csv("dictionary.csv")



pattern='|'.join(dictionary['Terms'])


news["contain diseases1"] = np.where(
   news['STORY'].str.contains(pattern, na=False),
    1, 0
)

news.to_csv("news1.csv")

But, since the code is only joining the terms in the dictionary.csv row by row instead of every words in the row, i couldn't get the output i wanted. I appreciate any help, TQ

  • What did you try? Show us some code. And why do you consider those files CSV when they are not. (CSV means comma separated values) – Adirio Jan 10 '20 at 14:22
  • 1
    `combine = '|'.join(dictionary['Terms'].str.split('\s+', expand=True).stack().unique())` ..? – Chris Adams Jan 10 '20 at 14:28
  • `Diabetes Mellitus` will match for exactly that, do you want to split each word out by white space and search by that? check out [this](https://stackoverflow.com/a/59681512/9375102) – Umar.H Jan 10 '20 at 14:34
  • @Datanovice yes, i tried using pattern=' '.join(dictionary['Terms']), but unable to get the output i wanted – strawberrylatte Jan 10 '20 at 14:46
  • @ChrisA i got this error, re.error: missing ), unterminated subpattern at position 36563 – strawberrylatte Jan 10 '20 at 14:47
  • @ChrisA Solution works for me - do : `words = df2['Terms'].str.split('\s+',expand=True).stack().unique().tolist()` then `df['STORY'].str.contains('|'.join(words),regex=True,case=False)` for your where statement. – Umar.H Jan 10 '20 at 15:10
  • @Datanovice the code works when i'm working with a smaller size csv file, but with large size csv file, i kept on getting this error, re.error: missing ), unterminated subpattern at position 36563 – strawberrylatte Jan 12 '20 at 08:45

0 Answers0