2

I have a dataframe that contains two columns which are both of type str. These columns generally contain numbers in them, but there also exists the occasional word/non-num character.

Example df

A    B    Name
25   X    R
600  243  B
Z@   650  Y
633  18   G

I am trying to filter the dataframe so that I only keep rows in which the element in A OR B is between 600-699. I cant cast the columns as type int because of the occasional non number.

I have tried the following code to iterate row by row and change number strings to ints, and then filter.

for index, row in df.iterrows():
    try:
        df = df['A'].astype(int)
    except:
        pass
    try:
        df = df['B'].astype(int)
    except:
        pass

df = df[ (df['A'] | df['B']) > 599]
df = df[ (df['A'] | df['B']) < 700]

Result df

A    B    Name
600  243  B
Z@   650  Y
633  18   G

Iterating through each row is very slow especially when dealing with 100k+ rows in a dataframe. Does anyone have any advice on how to do this more efficiently?

MaxB
  • 428
  • 1
  • 8
  • 24
  • This question has some good information with regards to performance when working with Pandas dataframe [link](https://stackoverflow.com/questions/24870953/does-pandas-iterrows-have-performance-issues) – stahamtan Sep 04 '19 at 17:12

1 Answers1

2

Even though you don't want to convert non-numeric values to numeric, it is going to be most efficient to operate on numerical Series as a whole, rather than row by row.

Here is a way for you to convert each desired column to a numeric Series using pd.to_numeric, and then or each mask using np.logical_or.reduce across every mask at once. This will allow you to scale this solution to any number of columns efficiently.


def fn(s, mn, mx):
    return pd.to_numeric(s, errors='coerce').between(mn, mx)

m = np.logical_or.reduce([fn(df[col], 600, 699) for col in ('A', 'B')])

df.loc[m]

     A    B Name
1  600  243    B
2   Z@  650    Y
3  633   18    G
user3483203
  • 50,081
  • 9
  • 65
  • 94