3

How can I loop through an entire dataframe to remove data in cells that contain a particular string when the column names are unknown?

Here’s what I have so far:

for (i in colnames(df)){
   df2 = df[~df[i].str.contains('found')]

My data:

  Getting links from: https://www.bar.com/ Getting links from: https://www.boo.com/ Getting links from: https://www.foo.com/
0           ├───OK─── http://www.this.com/           ├───OK─── http://www.this.com/           ├───OK─── http://www.this.com/
1             ├───OK─── http://www.is.com/             ├───OK─── http://www.is.com/             ├───OK─── http://www.is.com/
2         ├─BROKEN─ http://www.broken.com/     2 links found. 0 excluded. 0 broken.         ├─BROKEN─ http://www.broken.com/
3                                      NaN                                      NaN            ├───OK─── http://www.set.com/
4                                      NaN                                      NaN            ├───OK─── http://www.one.com/

How can I delete the entire contents of a cell if it contains a string, for instance, 'found'? I want to delete everything in the cell (including before and after the string.)

Oleg
  • 560
  • 8
  • 19
lane
  • 659
  • 8
  • 24
  • Can you be more specific regarding _delete the entire contents_ , do you want it to contain a None NaN or perhaps an empty string ? – Oleg Dec 31 '18 at 11:19
  • An empty string is the intended result. – lane Dec 31 '18 at 11:20

3 Answers3

4

You can use applymap here:

import re    

# just make sure all values are string type
df = df.astype(str)    

# remove the text which has word found
df.applymap(lambda x: re.sub('.*found.*','',x ))

                                          0
0  Getting links from: https://www.boo.com/
1            ├───OK─── http://www.this.com/
2              ├───OK─── http://www.is.com/
3                                          
4                                       nan
5                                       nan
YOLO
  • 20,181
  • 5
  • 20
  • 40
  • this is crisp. :) +1 – anky Dec 31 '18 at 11:16
  • @YOLO thank you for the suggestion! I'm getting an error: "TypeError: ('expected string or buffer', u'occured at index Getting links from: https://www.bar.com/') Any insight? – lane Dec 31 '18 at 11:32
  • @lane how does the header look like? is that the header ? – YOLO Dec 31 '18 at 11:34
  • @lane when you read the data, make sure you do `header=None`, i edited a bit in the answer. – YOLO Dec 31 '18 at 11:36
1

As you are looking for a string/value to check and take the action against the whole DataFrame hence DataFrame.replace method fits the caveats here ..

Example DataFrame:

>>> df
      a
0  foo1
1  foo2
2   bar
3   bar
4   bar

Replacing bar with blank or you can replace it with NaN Value if desired:

>>> df.replace("bar", "", regex=True)
      a
0  foo1
1  foo2
2
3
4

Or replace bar with NaN

>>> df.replace("bar", np.nan, regex=True)
 # df.replace("bar", np.nan, regex=True, inplace=True)
      a
0  foo1
1  foo2
2   NaN
3   NaN
4   NaN

You can choose to have inplace=True if you like to replace to take place into the actual dataframe:

Simulated the example as given:

>>> df
                                                                                                                         col1
0  Getting links from: https://www.bar.com/ Getting links from: https://www.boo.com/ Getting links from: https://www.foo.com/
1            ├───OK─── http://www.this.com/           ├───OK─── http://www.this.com/           ├───OK─── http://www.this.com/
2              ├───OK─── http://www.is.com/             ├───OK─── http://www.is.com/             ├───OK─── http://www.is.com/
3          ├─BROKEN─ http://www.broken.com/     2 links found. 0 excluded. 0 broken.         ├─BROKEN─ http://www.broken.com/
4                                       NaN                                      NaN            ├───OK─── http://www.set.com/
5                                       NaN                                      NaN            ├───OK─── http://www.one.com/

Result with str.contains:

>>> df[~df["col1"].str.contains("found")]
                                                                                                                         col1
0  Getting links from: https://www.bar.com/ Getting links from: https://www.boo.com/ Getting links from: https://www.foo.com/
1            ├───OK─── http://www.this.com/           ├───OK─── http://www.this.com/           ├───OK─── http://www.this.com/
2              ├───OK─── http://www.is.com/             ├───OK─── http://www.is.com/             ├───OK─── http://www.is.com/
4                                       NaN                                      NaN            ├───OK─── http://www.set.com/
5                                       NaN                                      NaN            ├───OK─── http://www.one.com/

OR as Stated make sure values are string type to apply the operation

>>> df[~df["col1"].astype(str).str.contains("found")]
Karn Kumar
  • 8,518
  • 3
  • 27
  • 53
  • thanks for the suggestion! After running that line, the data is unchanged. I added my data to my post for you to look at. – lane Dec 31 '18 at 11:03
  • As the documentation states it's using `re.sub` under the hood, yet in a more convenient interface in my opinion. @lane have you added `inplace=True` ? or `df = df.replace("bar", "", regex=True)` – Oleg Dec 31 '18 at 11:35
  • @Oleg yes, I've tried both. No errors, but also no change. – lane Dec 31 '18 at 11:37
  • Also, I think the string should be replaced with the regex as mentioned in @YOLO's answer `.*found.*` in order to replace the entire str and not just the exact part that matched – Oleg Dec 31 '18 at 11:40
  • Sorry, was awy from the system, @Oleg rightly said replace method i placed in my answer before the post was re-updated, However, Yolo's answer should do the Job for the updated post. – Karn Kumar Dec 31 '18 at 11:49
  • Even `df[~df["col1"].str.contains("found")]` is working fine for me . – Karn Kumar Dec 31 '18 at 11:50
0

You can replace with nan with df.replace({'test':np.nan}) and that should replace all instances

RyanL
  • 156
  • 1
  • 5