1

I have a customer care call log in an excel sheet. Below is the format of the data i have

So#   Comments
1   sjhsh QUOTE 234566
1   sdsds customer call QUote 239876 Call back
2   adsdfh unknown call from customer QUOTE 189067 sdkjsd woieweio 
3   QUOTE 657894 customer called for service

I am reading this data from excel and need to get 6 digits afetr the text "QUOTE" in each row and then add the extracted digits as a new column

1.The rows might have multiple "QUOTE" mentions 2.The rows might not have "QUOTE"at all

Can someone please help me out with this substring search using python

import pandas as pd
import re
file=pd.read_excel("C:/Users/rkatta/Desktop/Book1.xlsx")
file.set_index('Index', inplace=True, drop=True)
comments=file['InternalComments']
quotenum=[]

keyword= 'QUOTE'
for i in comments:
    try:
        befor_keyowrd, keyword, after_keyword = comments[i].partition(keyword)
        num=after_keyword[:6]
        quotenum.append(num)
    except AttributeError:
        befor_keyowrd, keyword, after_keyword =''
        quotenum.append(after_keyword)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
reddy
  • 151
  • 3
  • 11
  • 2
    So, if there can be no "Quote" at all, what is the pattern? Just digits? `df['your_cal'].str.extractall(r"(\d+)")`? Or, `df['your_col'].str.findall(r'\d+').apply(','.join)`? – Wiktor Stribiżew Jul 03 '18 at 17:17
  • What's the result you are getting? Will `Quote` always be in all uppercase? Are all the quotes 6 digits or longer? – Sunny Patel Jul 03 '18 at 17:17
  • Can you elaborate on @WiktorStribiżew's question? My answer won't work for you if the rows contain a number that you're trying to match, but no "quote" – emsimpson92 Jul 03 '18 at 17:24
  • if there is no "Quote" at all , it is just a random conversation with the customer and we need NULL values. All the digits extracted and null values(incase of no QUOTE in the text) must be a new column to the data frame – reddy Jul 03 '18 at 17:42
  • Reddy, try `file['newcol'] = file['InternalComments'].str.findall(r'(?i)quote\s+(\d+)').apply(','.join)` (not sure, but probably you may add `.fillna('')`). I doubt you need the whole code part starting with `comments=file['InternalComments']` – Wiktor Stribiżew Jul 03 '18 at 17:59
  • @WiktorStribiżew: Thanks you very much. This things works for me. Pheeww!! But, i have another small issue. If there is space between text and symbol, the numbers aren't extracted. Ex: QUOTE# 234567 (Worked fine) QUOTE # 234567 ( did not work). Request you to please address this issue. this is human entered data and cleaning this stuff. I am new to python too. Thanks in advance – reddy Jul 03 '18 at 19:06

2 Answers2

2

(?i)(?<=QUOTE )\d+ will capture the numbers you're looking for.

(?i) means the rest of the pattern is case insensitive, so it will match "QUote" and any variation of the word.

(?<=QUOTE ) means the numbers will be preceded by the word quote and a space

\d+ is your number

Demo

emsimpson92
  • 1,779
  • 1
  • 9
  • 24
  • OP says *2.The rows might not have "QUOTE"at all*. You can't rely on the `(?<=QUOTE )` lookbehind. – Wiktor Stribiżew Jul 03 '18 at 17:22
  • I assumed that meant if the row didn't contain *QUOTE*, then the row wouldn't contain a number for the quote either. If I am wrong, then let me know and I'll see if I can find a solution. – emsimpson92 Jul 03 '18 at 17:22
  • That is probably a pandas related comment. When extracting text, the rows with no match miss a value and the data frame cannot get updated... Let OP clarify. – Wiktor Stribiżew Jul 03 '18 at 17:25
  • Yes, some rows might not have "QUOTE" at all. In that case, a null value must be pushed and all these extracted digits and null values must be added as a new column to the dataframe "file" – reddy Jul 03 '18 at 17:39
  • Then rather than comparing it to the whole thing, compare it row by row. If there's no match, just add a null value, otherwise add the number. Hope this helps. – emsimpson92 Jul 03 '18 at 17:43
  • my code works for comments[1].partition(keyword). Which means, it works for first row. It dont work when I try to loop it over all the rows in the dataframe column – reddy Jul 03 '18 at 17:44
  • @emsimpson92: Can you please help me with the loop row by row over the pandas dataframe column? – reddy Jul 03 '18 at 17:46
  • I'm unfamiliar with pandas myself, but [it seems like that question has been asked before](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas) – emsimpson92 Jul 03 '18 at 17:48
1

You need to replace your column manipulation part with the following line:

file['InternalComments'] = file['Comments'].str.findall(r'(?i)quote\s+(\d+)').apply(','.join)

See the regex demo.

The regex matches:

  • (?i) - case insensitive mode
  • quote - a quote substring
  • \s* - 0+ whitespaces
  • (\d+) - Capturing group 1 (what is returned by findall): 1+ digits.

See the Python code demo:

from pandas import DataFrame
import pandas as pd
l = ['sjhsh QUOTE 234566', 'sdsds customer call QUote 239876 Call back', 'adsdfh unknown call from customer QUOTE 189067 sdkjsd woieweio', 'QUOTE 657894 customer called for service', 'QUOTE 657894 customer called for service QUOTE 657894 customer called for service', 'No qte']
file = pd.DataFrame(l, columns=['Comments'])
file['InternalComments'] = file['Comments'].str.findall(r'(?i)quote\s*(\d+)').apply(','.join)
file
                                            Comments InternalComments
0                                 sjhsh QUOTE 234566           234566
1         sdsds customer call QUote 239876 Call back           239876
2  adsdfh unknown call from customer QUOTE 189067...           189067
3           QUOTE 657894 customer called for service           657894
4  QUOTE 657894 customer called for service QUOTE...    657894,657894
5                                             No qte                 
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563