How do I filter out multiple columns witha certain string in Python

Question

I'm new to python and especially to pandas so I don't really know what I'm doing. I have 10 columns with 100000 rows and 4 letter strings. I need to filter out rows which don't contain 'DDD' in all of the columns/rows.

I tried to do it with iloc and loc, but it doesn't work:

import pandas as pd
df = pd.read_csv("data_3.csv", delimiter = '!')
df.iloc[:,10:20].str.contains('DDD', regex= False, na = False)
df.head()

It returns me an error: 'DataFrame' object has no attribute 'str'

Possible duplicate of [Search for String in all Pandas DataFrame columns and filter](https://stackoverflow.com/questions/26640129/search-for-string-in-all-pandas-dataframe-columns-and-filter) — Ari Cooper-Davis, Nov 10 '19 at 17:21

Christian Sloper · Answer 1 · 2019-11-10T17:32:46.603

4

I suggest doing it without a for loop like this:

df[df.apply(lambda x: x.str.contains('DDD')).all(axis=1)]

To select only string columns

df[df.select_dtypes(include='object').apply(lambda x: x.str.contains('DDD')).all(axis=1)]

To select only some string columns

selected_cols = ['A','B']
df[df[selected_cols].apply(lambda x: x.str.contains('DDD')).all(axis=1)]

edited Nov 10 '19 at 17:32

answered Nov 10 '19 at 17:17

Christian Sloper

7,440
3
15
28

I also have another 10 columns which contain various values, and I thought by removing them with df.drop they wouldn't cause any issues, but that's not the case. Can I somehow apply to only those 10 columns with strings? – Goldust34 Nov 10 '19 at 17:30
It returns me only the column names, no values, and if add `df` at the end, it returns me the original, unmodified dataframe – Goldust34 Nov 10 '19 at 17:41
you can maybe check dtypes? check that they are object? – Christian Sloper Nov 10 '19 at 17:42
it tests fine on @Aris example frame below. – Christian Sloper Nov 10 '19 at 17:43
also, this does not modify your original df, if you want that you have to add df = – Christian Sloper Nov 10 '19 at 17:44
columns 11 to 20 are indeed dtypes while columns 1 to 10 are float64 – Goldust34 Nov 10 '19 at 17:48
If they are object this should work, if only column names show there are no row with DDD in all string columns. Maybe there is a string column that shouldn’t have DDD ? – Christian Sloper Nov 10 '19 at 17:51
https://www.dropbox.com/s/e3cftt4z08bkmph/data_3.csv?dl=0 this is the file that I'm using. – Goldust34 Nov 10 '19 at 17:57
i cant find a row with DDD in every column? – Christian Sloper Nov 10 '19 at 18:01
can you give an example row number with that property? – Christian Sloper Nov 10 '19 at 18:03
It doesn't have to be in the same row. It has to be the same as if you would do `df.loc[df['col_20'].str.contains('DDD', na = False)]`, but for every column that has strings. Indexes don't matter. Idk if that was clear enough, ask if you need any more info :D – Goldust34 Nov 10 '19 at 20:37

score 2 · Answer 2 · answered Nov 10 '19 at 17:28

2

You can do this but if your all column type is StringType:

for column in foo.columns:
    df = df[~df[c].str.contains('DDD')]

answered Nov 10 '19 at 17:28

yasi

397
1
4
14

Ari Cooper-Davis · Answer 3 · 2019-11-10T17:20:51.453

You can use str.contains, but only on Series not on DataFrames. So to use it we look at each column (which is a series) one by one by for looping over them:

>>> import pandas as pd
>>> df = pd.DataFrame([['DDDA', 'DDDB', 'DDDC', 'DDDD'],
                       ['DDDE', 'DDDF', 'DDDG', 'DHDD'],
                       ['DDDI', 'DDDJ', 'DDDK', 'DDDL'],
                       ['DMDD', 'DNDN', 'DDOD', 'DDDP']],
                       columns=['A', 'B', 'C', 'D'])

>>> for column in df.columns:
        df = df[df[column].str.contains('DDD')]

In our for loop we're overwriting the DataFrame df with df where the column contains 'DDD'. By looping over each column we cut out rows that don't contain 'DDD' in that column until we've looked in all of our columns, leaving only rows that contain 'DDD' in every column.

This gives you:

>>> print(df)
      A     B     C     D
0  DDDA  DDDB  DDDC  DDDD
2  DDDI  DDDJ  DDDK  DDDL

As you're only looping over 10 columns this shouldn't be too slow.

Edit: You should probably do it without a for loop as explained by Christian Sloper as it's likely to be faster, but I'll leave this up as it's slightly easier to understand without knowledge of lambda functions.

How do I filter out multiple columns witha certain string in Python

3 Answers3