Removing rows from a Data Frame column which contains lists if a specific string is within the list

Question

Suppose I have a DataFrame pd with a column called 'elements' which contains a list of a list of objects as shown below:

print(df2['elements'])

0       [Element B, Element Cr, Element Re]
1       [Element B, Element Rh, Element Sc]
2       [Element B, Element Mo, Element Y]
3       [Element Al, Element B, Element Lu]
4       [Element B, Element Dy, Element Os]

I would like to search through the column and if, for example, Element Mo is in that row delete the whole row to look like this:

print(df2['elements'])

0       [Element B, Element Cr, Element Re]
1       [Element B, Element Rh, Element Sc]
2       [Element Al, Element B, Element Lu]
3       [Element B, Element Dy, Element Os]

I'm currently trying to do it with a for loop and if statements like this:

for entry in df2['elements']:
    if 'Element Mo' in entry:
        df2.drop(index=[entry],axis=0, inplace=True)
    else:
        continue

But it is not working and giving me a KeyError: [] not found in axis.

Update:

I just realized that the if and in statement route I showed does not search for exact string matches, but also strings that contain target string, so for example with the updated df below:

print(df2['elements'])

0       [Element B, Element Cr, Element Re]
1       [Element B, Element Rh, Element Sc]
2       [Element B, Element Mo, Element Y]
3       [Element Al, Element B, Element Lu]
4       [Element Mop, Element B, Element Lu]      
5       [Element B, Element Dy, Element Os]

If I run a for loop with if/in statements like this:

for ind in df2.index.values:
    entry = df2.loc[ind, 'elements']
    if 'Element Mo' in entry:
        df2.drop(index=ind ,axis=0, inplace=True)

Both row 2 and 5 will be dropped from the df because the string 'Element Mop' contains the string 'Element Mo', but I don't want this to happen. I tried updating the code above with regex like the one below, but it doesn't work.

for ind in df2.index.values:
        entry = df2.loc[ind, 'elements']
        if '\bElement Mo\b' in entry:
            df2.drop(index=ind ,axis=0, inplace=True)

Edit #2: Here is the dictionary of the first 25 items of the column:

df2_dict = df2['elements'].head(25).to_dict()

{0: '[Element B, Element Cr, Element Re]', 1: '[Element B, Element Rh, Element Sc]', 2: '[Element B, Element Mo, Element Y]', 3: '[Element Al, Element B, Element Lu]', 4: '[Element B, Element Dy, Element Os]', 5: '[Element B, Element Fe, Element Sc]', 6: '[Element B, Element Cr, Element W]', 7: '[Element B, Element Ni]', 9: '[Element B, Element Pr, Element Re]', 10: '[Element B, Element Cr, Element V]', 11: '[Element B, Element Co, Element Si]', 12: '[Element B, Element Co, Element Yb]', 13: '[Element B, Element Lu, Element Yb]', 14: '[Element B, Element Ru, Element Yb]', 15: '[Element B, Element Mn, Element Pd]', 16: '[Element B, Element Co, Element Tm]', 17: '[Element B, Element Fe, Element W]', 19: '[Element B, Element Ru, Element Y]', 20: '[Element B, Element Ga, Element Ta]', 21: '[Element B, Element Ho, Element Re]', 22: '[Element B, Element Si]', 23: '[Element B, Element Ni, Element Te]', 24: '[Element B, Element Nd, Element S]', 25: '[Element B, Element Ga, Element Rh, Element Sc]', 26: '[Element B, Element Co, Element La]'}

The actual issue here is that if I try to drop rows that contain the string 'Element S' (in row 25) all entries with elements like 'Element Sc' or 'Element Si' are also removed.

it would be helpful if you could provide `df.to_dict()` so we could re-create your data exactly. are the elements of your Series lists or strings with literal brackets and commas in the strings? e.g. are we trying to match on `", Element Mo, "` or is the string literal `"Element Mo"` in a list? — Michael Delgado, Sep 08 '22 at 20:42
I added the dictionary of that data frame column, the issue is with 'Element S'. I think it is a string object and not a list actually. — J0chin, Sep 08 '22 at 23:05
ok - @J0chin I updated my answer. It's really helpful if you can try to fully describe the data, including providing a [mre] when asking a question. here's a guide to creating a minimal reproducible example in pandas specifically: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — Michael Delgado, Sep 08 '22 at 23:19
also, the goal is to have questions and answers be concise and helpful for future readers who come across them, so it's preferred if you can simply edit your question to be more clear/precise rather than appending lots of "Edit" blocks. Thanks! — Michael Delgado, Sep 08 '22 at 23:21

Michael Delgado · Accepted Answer · 2022-09-10T01:50:37.580

A pandas Series is sort of like a dictionary, where the keys are the index and the values are the series values.

So, entry isn't in the index. You could loop over the index, use the index to reference the values, e.g.:

for ind in df2.index.values:
    entry = df2.loc[ind, "elements"]
    if 'Element Mo' in entry:
        df2.drop(index=ind, axis=0, inplace=True)

However, it would be far better to use a vectorized solution. This isn't really possible with a series of lists (this really breaks the pandas data model), but you could at least subset your series once instead of iteratively reshaping. For example:

in_values = df2["elements"].apply(lambda x: "Element Mo" in x)
dropped = df2.loc[~in_values]

Update

After your edits, it looks like we're actually dealing with strings which look like lists! In that case, you're probably looking for a regular expression to make sure you match a complete "Element", bounded by either whitespace, a comma, or a bracket character. Pandas has a number of string methods, and regular expressions may be passed to pd.Series.str.contains with the flag regex=True.

I'll use the following regular expression to match strings preceeded by a [ or ,, as well as any amount of whitespace, then matching on Element Mo, followed by any amount of whitespace and either of the characters ] or ,:

r"(?<=[\[,])\s*Element Mo\s*(?=[,\]])"

Pandas uses the same syntax as the builtin python re module - see that module's documentation for the full mini-language reference.

Applying this as a filter allows us to see the exact matches:

In [12]: df2[df2.str.contains(r"(?<=[\[,])\s*Element Mo\s*(?=[,\]])", regex=True)]
Out[12]:
2    [Element B, Element Mo, Element Y]
dtype: object

Similarly, we can invert the match and exclude any rows matching our filter:

In [13]: df2[~df2.elements.str.contains(r"(?<=[\[,])\s*Element Mo\s*(?=[,\]])", regex=True)]
Out[13]:
0                 [Element B, Element Cr, Element Re]
1                 [Element B, Element Rh, Element Sc]
3                 [Element Al, Element B, Element Lu]
4                 [Element B, Element Dy, Element Os]
5                 [Element B, Element Fe, Element Sc]
6                  [Element B, Element Cr, Element W]
7                             [Element B, Element Ni]
9                 [Element B, Element Pr, Element Re]
10                 [Element B, Element Cr, Element V]
11                [Element B, Element Co, Element Si]
12                [Element B, Element Co, Element Yb]
13                [Element B, Element Lu, Element Yb]
14                [Element B, Element Ru, Element Yb]
15                [Element B, Element Mn, Element Pd]
16                [Element B, Element Co, Element Tm]
17                 [Element B, Element Fe, Element W]
19                 [Element B, Element Ru, Element Y]
20                [Element B, Element Ga, Element Ta]
21                [Element B, Element Ho, Element Re]
22                            [Element B, Element Si]
23                [Element B, Element Ni, Element Te]
24                 [Element B, Element Nd, Element S]
25    [Element B, Element Ga, Element Rh, Element Sc]
26                [Element B, Element Co, Element La]
dtype: object

I tried your first suggestion and it works, only issue I've realized is that it's not searching for exact strings in the list since I'm sing and if in statement. For example in my df2 data Frame above if I look for the string 'Element C' it would drop row 0 since the string "Element Cr" which has all those same characters instead of not dropping any columns. I've tried using \b regex like '\bElement C\b' but this is not working do you have any suggestions? — J0chin, Sep 07 '22 at 22:20
sorry - I really am having a hard time following what you're asking for. Do you want to drop everything with the substring "Element C" in it, including a wildcard match after C? or "Element Mo"? could you update your question to include what you actually want so we can answer it fully? — Michael Delgado, Sep 07 '22 at 23:39
I updated my question, hopefully that clears up the confusion. — J0chin, Sep 08 '22 at 20:20
Okay thanks for the help. I am having trouble implementing your latest suggestion here, would I be able implement the Series.str.contains method with the for loop or should I use the second vectorized solution? — J0chin, Sep 09 '22 at 21:55
you should be able to run my code straight through (no loop). the vectorized method is definitely the one I'd recommend. — Michael Delgado, Sep 09 '22 at 22:26
Hmm okay, when I try to run the two lines I am getting the "NameError: name 's' is not defined" — J0chin, Sep 09 '22 at 23:09
I called my series "s" - I think you call it df2. edited to use your name (though a series isn't usually called df by convention) — Michael Delgado, Sep 09 '22 at 23:22
Okay sorry, but I am still confused then so how do we then drop the rows? because when I run it df2, it is giving me the "AttributeError: 'DataFrame' object has no attribute 'str'". and I tried running it on the column df2.elements, which runs without errors, but none of the rows are dropped. — J0chin, Sep 10 '22 at 00:43
updated to use `df2.elements`. you should be able to run the exact line I ran in `In [13]:`. if you want to permanently drop those rows from the dataframe, just assign the result back to the dataframe, e.g. `df2 = df2[~df2.elements.str.contains(...)]` — Michael Delgado, Sep 10 '22 at 01:52
Oh right that last part did the trick, thanks again so much for your help! — J0chin, Sep 10 '22 at 02:31

Naveed · Answer 2 · 2022-09-11T18:56:12.760

1

here is one way to do it

string='Element Mo'

df[df['col1'].apply(lambda x: string not in x)]

col1
0   [Element B, Element Cr, Element Re]
1   [Element B, Element Rh, Element Sc]
3   [Element Al, Element B, Element Lu]
4   [Element B, Element Dy, Element Os]

I tried with the revised data sample you shared and this solution worked as expected

edited Sep 11 '22 at 18:56

answered Sep 07 '22 at 03:06

Naveed

11,495
2
14
21

Hey I updated my question, but I just figured the string that it searches for in the column to be an exact match. – J0chin Sep 08 '22 at 20:35
@J0chin, I'm not following you, do you want **Element Mop** to be filtered out? I just tried with your new data and it is an exact match. I also added the screenshot of my notebook. – Naveed Sep 09 '22 at 00:43
Hi @Naveed - [please don't upload images of data, code, errors, or outputs](//meta.stackoverflow.com/q/285551/). You can print a dataframe and paste the string outputs as a text block. – Michael Delgado Sep 09 '22 at 22:28
@MichaelDelgado, my solution was all text, the revision to the question by OP was confusing , so I added the screenshot to proof that solution provided above as text did worked even for his revision scenario. I’m not in disagreement with you though, and appreciate you taking time to point it out. Regards – Naveed Sep 09 '22 at 22:52
yep - I see what you were doing. but it's still the case that a big part of your answer is not searchable or readable by people with visual impairments. – Michael Delgado Sep 09 '22 at 23:25

Removing rows from a Data Frame column which contains lists if a specific string is within the list

2 Answers2

Update