3

I'm new to python and especially to pandas so I don't really know what I'm doing. I have 10 columns with 100000 rows and 4 letter strings. I need to filter out rows which don't contain 'DDD' in all of the columns/rows.

I tried to do it with iloc and loc, but it doesn't work:

import pandas as pd
df = pd.read_csv("data_3.csv", delimiter = '!')
df.iloc[:,10:20].str.contains('DDD', regex= False, na = False)
df.head()

It returns me an error: 'DataFrame' object has no attribute 'str'

David Buck
  • 3,752
  • 35
  • 31
  • 35
Goldust34
  • 31
  • 1
  • 2
  • Possible duplicate of [Search for String in all Pandas DataFrame columns and filter](https://stackoverflow.com/questions/26640129/search-for-string-in-all-pandas-dataframe-columns-and-filter) – Ari Cooper-Davis Nov 10 '19 at 17:21

3 Answers3

4

I suggest doing it without a for loop like this:

df[df.apply(lambda x: x.str.contains('DDD')).all(axis=1)]

To select only string columns

df[df.select_dtypes(include='object').apply(lambda x: x.str.contains('DDD')).all(axis=1)]

To select only some string columns

selected_cols = ['A','B']
df[df[selected_cols].apply(lambda x: x.str.contains('DDD')).all(axis=1)]
Christian Sloper
  • 7,440
  • 3
  • 15
  • 28
2

You can do this but if your all column type is StringType:

for column in foo.columns:
    df = df[~df[c].str.contains('DDD')]
yasi
  • 397
  • 1
  • 4
  • 14
1

You can use str.contains, but only on Series not on DataFrames. So to use it we look at each column (which is a series) one by one by for looping over them:

>>> import pandas as pd
>>> df = pd.DataFrame([['DDDA', 'DDDB', 'DDDC', 'DDDD'],
                       ['DDDE', 'DDDF', 'DDDG', 'DHDD'],
                       ['DDDI', 'DDDJ', 'DDDK', 'DDDL'],
                       ['DMDD', 'DNDN', 'DDOD', 'DDDP']],
                       columns=['A', 'B', 'C', 'D'])

>>> for column in df.columns:
        df = df[df[column].str.contains('DDD')]

In our for loop we're overwriting the DataFrame df with df where the column contains 'DDD'. By looping over each column we cut out rows that don't contain 'DDD' in that column until we've looked in all of our columns, leaving only rows that contain 'DDD' in every column.

This gives you:

>>> print(df)
      A     B     C     D
0  DDDA  DDDB  DDDC  DDDD
2  DDDI  DDDJ  DDDK  DDDL

As you're only looping over 10 columns this shouldn't be too slow.


Edit: You should probably do it without a for loop as explained by Christian Sloper as it's likely to be faster, but I'll leave this up as it's slightly easier to understand without knowledge of lambda functions.

Ari Cooper-Davis
  • 3,374
  • 3
  • 26
  • 43