0

I have around 1.3M strings (representing user requirements when they mail the IT Helpdesk) in a Pandas df. I also have a series of 29,813 names that I want to remove from these strings so that I am only left with words that describe the problem. Here is a mini-example of the data - it works, but it takes way too long. I am looking for a more efficient way to achieve this result:

Input:

List1 = ["George Lucas has a problem logging in", 
         "George Clooney is trying to download data into a spreadsheet", 
         "Bart Graham needs to logon to CRM urgently", 
         "Lucy Anne George needs to pull management reports"]
List2 = ["Access Team", "Microsoft Team", "Access Team", "Reporting Team"]

df = pd.DataFrame({"Team":List2, "Text":List1})

xwords = pd.Series(["George", "Lucas", "Clooney", "Lucy", "Anne", "Bart", "Graham"])

for word in range(len(xwords)):
    df["Text"] = df["Text"].str.replace(xwords[word], "! ")

# Just using ! in the example so one can clearly see the result

Output:

Team                Text
0   Access Team     ! ! has a problem logging in
1   Microsoft Team  ! ! is trying to download data into a spreadsheet
2   Access Team     ! ! needs to logon to CRM urgently
3   Reporting Team  ! ! ! needs to pull management reports

I have tried to find the answer for quite some time: if I missed it somewhere due to lack of experience please just be gentle and let me know!

Many thanks :)

LisaM
  • 21
  • 1
  • 4
  • [this answer has given timeit for replace using forloop, map etc.](https://stackoverflow.com/a/30461672/7053679) – Nihal Jul 09 '18 at 14:24
  • Have you checked out the `replace` method for Pandas dataframes? https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html – Joel Jul 09 '18 at 14:31
  • @LisaM does the linked question for speeding up help you? I can craft a Pandas-specific solution if you want – Ciprian Tomoiagă Jul 09 '18 at 15:42
  • Hi @ciprian-tomoiag - am busy digesting the recommended "speed up millions" info - will let you know if I can find a way to make one of the suggestions work for me, it looks entirely on-topic, thanks! – LisaM Jul 09 '18 at 18:12

3 Answers3

2

Thanks Ciprian Tomoiagă for pointing me to the post Speed up millions of regex replacements in Python 3. The option provided by Eric Duminil, see "Use this method (with set lookup) if you want the fastest solution", works equally well in the Pandas environment with series instead of lists - example code for this problem repeated below, and on my large dataset the entire thing completed in 2.54 seconds!

Input:

import re

banned_words = set(word.strip().lower() for word in xwords)

def delete_banned_words(matchobj):
    word = matchobj.group(0)
    if word.lower() in banned_words:
        return ""
    else:
        return word

sentences = df["Text"]

word_pattern = re.compile('\w+')

df["Text"] = [word_pattern.sub(delete_banned_words, sentence) for sentence in sentences]
print(df)

Output:

Team              Text
Access Team       has a problem logging in
Microsoft Team    is trying to download data into a spreadsheet
Access Team       needs to logon to CRM urgently
Reporting Team    needs to pull management reports
LisaM
  • 21
  • 1
  • 4
0

I'd recommend tokenizing the texts and using a set for the names:

xwords = set(["George", "Lucas", ...])
df["Text"] = ' '.join(filter(lambda x: x not in xwords, df["Text"].str.split(' ')))

Depending on your strings the tokenization would need to be more elaborate than just splitting on spaces.

There might be a pandas specific ways to do this but I have little experience with that ;)

  • I couldn't find a split() method on a df :) – LisaM Jul 09 '18 at 14:58
  • yeah, I forgot to add the `.str`. I will edit it. It probably has to be converted back into a DataFrame afterwards. Again, pandas not my expertise, sorry. But the underlying principle still holds. – Sebastian Bickerle Jul 09 '18 at 16:05
0

pandas.Series.str.replace can take a compiled regex as a pattern

import re
patt = re.compile(r'|'.join(xwords))
df["Text"] = df["Text"].str.replace(patt, "! ")

Maybe this could help? I don't have experience with such a long regular expressions, though.

karla
  • 369
  • 2
  • 8
  • Apologies, I was cautiously optimistic, but gave up after 10 minutes! It does the job (and thank you for showing me how to compile a long regex!) but still quite slow... – LisaM Jul 09 '18 at 15:17