I have a pandas dataframe column of strings and a list of substrings (phrases). What I am trying to do, is to iterate over all strings and for each string construct a new column that would contain only the substrings (phrases) that exist in that particular string (based on the existing list of substrings). And I can't find an optimal way to do it in order to avoid waiting for ages.
An example of the code for the function that I created that runs on a single string:
def myfunc(text,skills):
res=[]
for skill in skills:
skill2=" "+str(skill)+" "
if skill2 in text:
res.append(skill)
return res
k=myfunc("This is a test text containing .niet network as well as 2008 r2 to find out f the f# skills",['.niet','2008 r2','net','f','f#'])
print(k)
the output here should be:
['.niet', '2008 r2', 'f', 'f#']
I created the function above so that I can call it inside the pandas.dataframe.apply() function, in order to iterate for all string entries of the dataframe's "description" column...
example code :
dev['sample'] = dev['description'].apply(lambda x: myfunc(x,myskillslist=['.niet','2008 r2','net','f','f#']))
x represents each document/string while myskillslist is the list of substrings (which is a predifined list that doesn't change).
any ideas? Is there a better way of doing this? I have searched a lot and wasn't able to create a faster solution.