1

Having a list of fruits, I want to check if and which if them existing in the data frame (regardless which columns), and indicate them.

import pandas as pd

Fruits = ["Avocado", "Blackberry", "Black Sapote", "Fingered Citron", "Crab Apples", "Custard Apple", "Chico Fruit", "Coconut", "Damson", "Elderberry", "Goji Berry", "Grape", "Guava", "Huckleberry"]

data = {'ID': ["488", "14805", "23591", "470995", "56251", "85964", "5268", "322624", "342225", "380689", "480562", "5623"], 
'Content' : ["Kalo Beruin", "this is Blackberry", "Khara Beruin", "guava and coconut", "Lapha", "Loha Sura", "Matichak", "Miniket Rice", "Mou Beruin", "Moulata", "oh Goji Berry", "purple Grape"],
'Content_1' : ["Jook-sing noodles", "grape", "Lai fun", "Damson", "Liangpi", "Custard Apple and Crab apples", "Misua", "nana Coconut Berry", "Damson", "Paomo", "Ramen", "Rice vermicelli"]}

df = pd.DataFrame(data)
df = df[['ID', 'Content', 'Content_1']]

s = pd.Series(data['Content'])
s_1 = pd.Series(data['Content_1'])

df["found_content"] = s[s.str.contains('|'.join(Fruits))]
df["found_content_1"] = s_1[s_1.str.contains('|'.join(Fruits))]

writer = pd.ExcelWriter('C:\\TEM\\22522.xlsx')
df.to_excel(writer,'Sheet1', index = False)
writer.save()

the problems of the codes are:

  1. Instead of indicating the Fruit, it shows the whole content. for example the row of 14805, it shall be "Blackberry" only instead of the whole original content.
  2. It's case sensitive so missing some findings, like the row of 14805.
  3. I want to use ";" to separate the finding, like the row of 85964.

How can I achieve it? Thank you.

Here's the screenshot of the current output and wanted output.

enter image description here

Mark K
  • 8,767
  • 14
  • 58
  • 118
  • 2
    This is a bit windy, can you please simplify this example if possible? – cs95 Jul 30 '18 at 06:59
  • @coldspeed, thank you for the comment. It was to provide more sample for the testing. I will pay attention to it next time. – Mark K Jul 30 '18 at 08:00

2 Answers2

3

Use str.findall with re.I for ignore uper/lowercase and then join lists by str.join:

import re
#\b for word boundary - general use
pat = r'(\b{}\b)'.format('|'.join(Fruits))
df["found_content"] = df['Content'].str.findall(pat, re.I).str.join(';')
df["found_content_1"] = df['Content_1'].str.findall(pat, re.I).str.join(';')
print (df)
        ID             Content                      Content_1  found_content  \
0      488         Kalo Beruin              Jook-sing noodles                  
1    14805  this is Blackberry                          grape     Blackberry   
2    23591        Khara Beruin                        Lai fun                  
3   470995   guava and coconut                         Damson  guava;coconut   
4    56251               Lapha                        Liangpi                  
5    85964           Loha Sura  Custard Apple and Crab apples                  
6     5268            Matichak                          Misua                  
7   322624        Miniket Rice             nana Coconut Berry                  
8   342225          Mou Beruin                         Damson                  
9   380689             Moulata                          Paomo                  
10  480562       oh Goji Berry                          Ramen     Goji Berry   
11    5623        purple Grape                Rice vermicelli          Grape   

              found_content_1  
0                              
1                       grape  
2                              
3                      Damson  
4                              
5   Custard Apple;Crab apples  
6                              
7                     Coconut  
8                      Damson  
9                              
10                             
11         

Another solution is use title instead re.I:

pat = r'(\b{}\b)'.format('|'.join(Fruits))
df["found_content"] = df['Content'].str.title().str.findall(pat).str.join(';')
df["found_content_1"] = df['Content_1'].str.title().str.findall(pat).str.join(';')
print (df)
        ID             Content                      Content_1  found_content  \
0      488         Kalo Beruin              Jook-sing noodles                  
1    14805  this is Blackberry                          grape     Blackberry   
2    23591        Khara Beruin                        Lai fun                  
3   470995   guava and coconut                         Damson  Guava;Coconut   
4    56251               Lapha                        Liangpi                  
5    85964           Loha Sura  Custard Apple and Crab apples                  
6     5268            Matichak                          Misua                  
7   322624        Miniket Rice             nana Coconut Berry                  
8   342225          Mou Beruin                         Damson                  
9   380689             Moulata                          Paomo                  
10  480562       oh Goji Berry                          Ramen     Goji Berry   
11    5623        purple Grape                Rice vermicelli          Grape   

              found_content_1  
0                              
1                       Grape  
2                              
3                      Damson  
4                              
5   Custard Apple;Crab Apples  
6                              
7                     Coconut  
8                      Damson  
9                              
10                             
11                 
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • thank you sir! however when I applied the codes to another file/data frame, the findings are shown as NaN. I am checking. – Mark K Jul 30 '18 at 07:26
  • sir, could you please advise when it has a match, but shows "NaN"? (I applied the code to an workbook. it has findings but all shows NaN) – Mark K Jul 30 '18 at 07:42
  • 1
    Hard question, it seems data problem. Real data are confidental? – jezrael Jul 30 '18 at 07:44
  • the problem lies in the Excel file itself - when I copy and paste the content of the Excel columns into a data frame, and run the code. It works fine. but when I use "work_file = "C:\\folder\\file.xls", df = pd.read_excel(work_file, sheetname = "Sheet 1"). The exact content shows "NaN". – Mark K Jul 30 '18 at 07:58
  • sir, the problem is, there are Non-ASCII character in the columns of Content and Content_1. Would you be so kind to advise if there's a solution to this problem? Thank you. – Mark K Jul 30 '18 at 08:28
  • 1
    @MarkK - It is python2 ? – jezrael Jul 30 '18 at 08:29
  • 1
    @MarkK - So problem with encoding, really not easy help, because it depends of data. But one idea - how working [this](https://stackoverflow.com/a/30470630) or [this](https://stackoverflow.com/q/42421967) – jezrael Jul 30 '18 at 08:32
  • thank you so much again for pointing the direction. you are marvelous! – Mark K Jul 30 '18 at 08:34
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/177061/discussion-between-mark-k-and-jezrael). – Mark K Jul 31 '18 at 06:13
0

Maybe this:

import pandas as pd

Fruits = ["Avocado", "Blackberry", "Black Sapote", "Fingered Citron", "Crab Apples", "Custard Apple", "Chico Fruit", "Coconut", "Damson", "Elderberry", "Goji Berry", "Grape", "Guava", "Huckleberry"]

data = {'ID': ["488", "14805", "23591", "470995", "56251", "85964", "5268", "322624", "342225", "380689", "480562", "5623"], 
'Content' : ["Kalo Beruin", "this is Blackberry", "Khara Beruin", "guava and coconut", "Lapha", "Loha Sura", "Matichak", "Miniket Rice", "Mou Beruin", "Moulata", "oh Goji Berry", "purple Grape"],
'Content_1' : ["Jook-sing noodles", "grape", "Lai fun", "Damson", "Liangpi", "Custard Apple and Crab apples", "Misua", "nana Coconut Berry", "Damson", "Paomo", "Ramen", "Rice vermicelli"]}

df = pd.DataFrame(data)
df["found_content"] = df['Content'].str.extract('(?P<Fruits>{})'.format("|".join(Fruits)), expand=True).fillna('')
df["found_content_1"] = df['Content_1'].str.extract('(?P<Fruits>{})'.format("|".join(Fruits)), expand=True).fillna('')

writer = pd.ExcelWriter('filename.xlsx')
df.to_excel(writer,'Sheet1', index = False)
writer.save()
U13-Forward
  • 69,221
  • 14
  • 89
  • 114
  • thank you for your help. when the codes applied to the sample, it only picks 1 finding instead of all the findings – Mark K Jul 30 '18 at 07:41