I have around 1.3M strings (representing user requirements when they mail the IT Helpdesk) in a Pandas df. I also have a series of 29,813 names that I want to remove from these strings so that I am only left with words that describe the problem. Here is a mini-example of the data - it works, but it takes way too long. I am looking for a more efficient way to achieve this result:
Input:
List1 = ["George Lucas has a problem logging in",
"George Clooney is trying to download data into a spreadsheet",
"Bart Graham needs to logon to CRM urgently",
"Lucy Anne George needs to pull management reports"]
List2 = ["Access Team", "Microsoft Team", "Access Team", "Reporting Team"]
df = pd.DataFrame({"Team":List2, "Text":List1})
xwords = pd.Series(["George", "Lucas", "Clooney", "Lucy", "Anne", "Bart", "Graham"])
for word in range(len(xwords)):
df["Text"] = df["Text"].str.replace(xwords[word], "! ")
# Just using ! in the example so one can clearly see the result
Output:
Team Text
0 Access Team ! ! has a problem logging in
1 Microsoft Team ! ! is trying to download data into a spreadsheet
2 Access Team ! ! needs to logon to CRM urgently
3 Reporting Team ! ! ! needs to pull management reports
I have tried to find the answer for quite some time: if I missed it somewhere due to lack of experience please just be gentle and let me know!
Many thanks :)