Delete rows if rows (not columns separately) contain a string

Question

I import data from a CSV where I am replacing the empty fields with an 'EMPTYFIELD' value.

pd.read_csv('myFile.csv', usecols=['AAA', 'BBB', 'CCC'])
df =  df.fillna('EMPTYFIELD')

I am trying to create a dataframe that will have all the rows that contain an 'EMPTYFIELD' value. That implies that at least one column contains this value. I used the following and it works off course:

error = df[df.AAA.str.contains('EMPTYFIELD')]
error = error[error.BBB.str.contains('EMPTYFIELD')]
error = error[error.CCC.str.contains('EMPTYFIELD')]

Now, I am trying to reduct the lines in my code. So, I was thinking of using a lambda instead without referencing to the columns (ideal):

error2 = df.apply(lambda x: 'EMPTYFIELD' if 'EMPTYFIELD' in x else x)

#error2 = df.apply(lambda x : any([ isinstance(e, 'EMPTYFIELD') for e in x ]), axis=1)

and then I tried referencing the columns too:

error2 = df[usecols].apply(lambda x: 'EMPTYFIELD' if 'EMPTYFIELD' in x else x)

and

error2 = df[df[usecols].isin(['EMPTYFIELD'])]

None of the above work. I print the results in a new CSV file. I can see all the rows even if they contain the 'EMPTYFIELD' value.

UPD: This is my extended code. Some of the answers return an error possible because of the lines below:

varA      = 'AAA';
dfGrouped = df.groupby(varA, as_index=False).agg({'Start Date': 'min', 'End Date': 'max'}).copy()

varsToKeep = ['AAA', 'BBB', 'CCC', 'Start Date_grp', 'End Date_grp' ]
dfTemp = pd.merge(df, dfGrouped, how='inner', on='AAA', suffixes=(' ', '_grp'), copy=True)[varsToKeep]

errors = dfTemp[~np.logical_or.reduce([dfTemp[varsToKeep].str.contains('EMPTYFIELD') for varsToKeep in dfTemp])]

"Now, I am trying to **reduct** the rows" What does this mean? — jpp, Feb 15 '18 at 16:56
Please provide a [**Minimal, Complete, Verifiable** Example](https://stackoverflow.com/help/mcve). — Alex, Feb 15 '18 at 16:56
If you want to delete the rows with the bad field, why not just do `df[~df.AAA.str.contains('EMPTYFIELD')]`? — pault, Feb 15 '18 at 16:56
@pault I do not want to reference to the column names. AAA is one of the columns. — Datacrawler, Feb 15 '18 at 16:57
@pault I could use a for loop checking for the usecols to do that. But I prefer the code not to refer to the column names. — Datacrawler, Feb 15 '18 at 17:03
@ApoloRadomer, I'm not sure if there's another way. See this [post](https://stackoverflow.com/a/26641085/5858851). However, if you do not replace the empty values with `"EMPTYFIELD"`, you may be able to use [`dropna()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html). — pault, Feb 15 '18 at 17:05
@pault `dropna()` is useful. But, I need also the opposite of the `dropna()` as I want to present both in two different files. — Datacrawler, Feb 15 '18 at 17:08
You can use `dropna()` to create a second DataFrame. Then filter the first one using the index of the second. — pault, Feb 15 '18 at 17:09

jpp · Accepted Answer · 2018-02-16T12:58:07.920

1

One way is to use np.logical_or.reduce. Here is an example:

import pandas as pd, numpy as np

df = pd.DataFrame([['A', 'B', 'C', 'D'],
                   ['E', 'F', 'G', 'H'],
                   ['G', 'A', 'D', 'I'],
                   ['L', 'K', 'A', 'J'],
                   ['S', 'T', 'U', 'V']],
                  columns=['COL1', 'COL2', 'COL3' ,'COL4'])

df[~np.logical_or.reduce([df[col].astype(str).str.contains('A') for col in df])]

#   COL1 COL2 COL3 COL4
# 1    E    F    G    H
# 4    S    T    U    V

edited Feb 16 '18 at 12:58

answered Feb 15 '18 at 17:07

jpp

159,742
34
281
339

@Tried that. I am getting an error possibly because this (UPDATE in my question) affects the code. – Datacrawler Feb 16 '18 at 12:18
`Can only use .str accessor with string values, which use np.object_ dtype in pandas` – Datacrawler Feb 16 '18 at 12:57
@ApoloRadomer. OK, it explains itself. I've updated code now for you to try. – jpp Feb 16 '18 at 12:58
Neat and tidy. I will search the exact logic behind this. I mean each "command" does. – Datacrawler Feb 16 '18 at 17:08

score 1 · Answer 2 · answered Feb 15 '18 at 17:35

1

Here's an illustration of how to use dropna() as I mentioned in the comments:

df = pd.DataFrame(
    {'A': [5,3,5,6], 
     'B': [None, "foo", "bar", "foobar"], 
     'C': ["foo","bar",None, "bat"]
    }
)
no_errors = df.dropna()
errors = df[~(df.index.isin(no_errors.index))]

Which results in the following 2 dataframes:

print(no_errors)
#   A       B    C
#1  3     foo  bar
#3  6  foobar  bat

print(errors)
#   A     B     C
#0  5  None   foo
#2  5   bar  None

Now if you want, you can call fillna() on the error DataFrame.

answered Feb 15 '18 at 17:35

pault

41,343
15
107
149

That works if I first use the `dropna()` and then `group by` before printing. Based on one column (A), let's say we have an ID 11111. We have 5 rows for this ID without empty fields. Then we have 1 row of which one column (B) has an empty value. I want all the 6 rows for this ID to count as an error. – Datacrawler Feb 16 '18 at 11:59

score 0 · Answer 3 · answered Feb 15 '18 at 17:29

0

As I mention using apply , data from jp

df[~df.apply(lambda x : x.str.contains('A')).any(1)]
Out[491]: 
  COL1 COL2 COL3 COL4
1    E    F    G    H
4    S    T    U    V

answered Feb 15 '18 at 17:29

BENY

317,841
20
164
234

That returned an error: `('Can only use .str accessor with string values, which use np.object_ dtype in pandas', 'occurred at index AAA')`. Before using your line, I am grouping by and then I am using the following: `pd.merge(df, dfGrouped, how='inner', on='AAA', suffixes=(' ', '_grp'), copy=True)[varsToKeep]` – Datacrawler Feb 16 '18 at 12:13

Delete rows if rows (not columns separately) contain a string

3 Answers3