how to get list of substrings from dataframe column of strings based on list of substrings in an optimal way?

Question

I have a pandas dataframe column of strings and a list of substrings (phrases). What I am trying to do, is to iterate over all strings and for each string construct a new column that would contain only the substrings (phrases) that exist in that particular string (based on the existing list of substrings). And I can't find an optimal way to do it in order to avoid waiting for ages.

An example of the code for the function that I created that runs on a single string:

def myfunc(text,skills):
    res=[]
    for skill in skills:
        skill2=" "+str(skill)+" "
        if skill2 in text:
            res.append(skill)
    return res

k=myfunc("This is a test text containing .niet network as well as 2008 r2 to find out  f the  f# skills",['.niet','2008 r2','net','f','f#'])
print(k)

the output here should be:

['.niet', '2008 r2', 'f', 'f#']

I created the function above so that I can call it inside the pandas.dataframe.apply() function, in order to iterate for all string entries of the dataframe's "description" column...

example code :

dev['sample'] = dev['description'].apply(lambda x: myfunc(x,myskillslist=['.niet','2008 r2','net','f','f#']))

x represents each document/string while myskillslist is the list of substrings (which is a predifined list that doesn't change).

any ideas? Is there a better way of doing this? I have searched a lot and wasn't able to create a faster solution.

welcome to stack overflow. kindly share a sample input dataframe and ur expected output. use this as a guide: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — sammywemmy, Feb 12 '20 at 00:37
Doing something similar to this post to try to let pandas do the heavy lifting might help, but I wouldn't be too surprised if your current solution is about as good as it gets. If you're lucky, pandas might have some native code to quickly finish the search, but you would need to try it to find out. https://davidhamann.de/2017/06/26/pandas-select-elements-by-string/ — Locke, Feb 12 '20 at 00:46
Thank you for your answers. I am going to add things as soon as I find the time today. However, I have already provided a reprofucible example in the function call with the desired output. Doing a minor edit now, to show that the second argument to myfunc is actually a constant and imutable list. — aggelos spyratos, Feb 12 '20 at 08:51

how to get list of substrings from dataframe column of strings based on list of substrings in an optimal way?

0 Answers0