Filtering pandas dataframe when column contains stings

Question

I have a dataframe that preexists in this structure:

import pandas as pd
d={'colA':['1','2','3','3','3'],'colB':['NaN','4','2','this','that']}
mydata=pd.DataFrame(data=d)

ColA is integers saved as strings ColB are all strings but contain a mix of integers, NaN and real strings.

I want to create a new column (colC) that checks if the integers in colB are greater than the integers in colA. But I can't figure out how to deal with the strings and NaNs.

The final dataframe should look like this:

d={'colA':[1,2,3,3,3],'colB':['NaN',4,2,'this','that'],'colC':['NaN','Yes','No','NaN','NaN']}
mydata_new=pd.DataFrame(data=d)

Thanks

jezrael · Answer 1 · 2022-10-04T08:38:16.340

1

Use to_numeric with errors='coerce' for numeric and compare by Series.gt and Series.le in numpy.select:

s1 = pd.to_numeric(mydata.colA, errors='coerce')
s2 = pd.to_numeric(mydata.colB, errors='coerce')

mydata['colC'] = np.select([s2.gt(s1), s2.le(s1)], ['Yes', 'No'], None)
print (mydata)
  colA  colB  colC
0    1   NaN  None
1    2     4   Yes
2    3     2    No
3    3  this  None
4    3  that  None

edited Oct 04 '22 at 08:38

answered Oct 04 '22 at 08:31

jezrael

822,522
95
1,334
1,252

Amazing. Thanks for accounting for my edits. – person Oct 04 '22 at 08:41

score 1 · Answer 2 · answered Oct 04 '22 at 09:02

First you can convert all the df to string class for compare different class values correctly and then you can compare them one by one. One solution can be:

mydata = mydata.astype(str)
colC = []
i = 0
while i<len(mydata):
    if "nan" in mydata.loc[i].values:
        colC.append(np.nan)
    else:
        # Here you have two options. The first one is more abstract but faster than comparing strings
        if len(set(mydata.loc[i].values)) == 1:
        # if mydata["colA"].values[i] == mydata["colB"].values[i]:
            colC.append("Yes")
        else:
            colC.append("No")
    i = i + 1
mydata["colC"] = colC

RESULTING ( I changed value of index two in order to get one "YES"):

  colA  colB colC
0    1   nan  NaN
1    2     2  Yes
2    3     2   No
3    3  this   No
4    3  that   No

Zamani Maxime · Accepted Answer · 2022-10-05T12:10:14.120

1

Using apply is a good way of handling this kind of computation:

def compareIntAndStrings(x, y):
    try:
        x = int(x)
        y = int(y)
    except ValueError:
        return "NaN"
    return "Yes" if x < y else "No"

mydata['colC'] = mydata.apply(lambda x: compareIntAndStrings(x['colA'], x['colB']), axis=1)

While this solution is easier ton understand and reuse, it should be considered that the performance cost is not negligible. It is also a better practice to replace the "NaN" strings to missing datas.

edited Oct 05 '22 at 12:10

answered Oct 04 '22 at 09:21

Zamani Maxime

26
4

Thanks, the other suggestions work too, but I've accepted this as it is the easiest for me to modify for other tasks – person Oct 04 '22 at 10:54
1

@person - It is really slow solution, check [this](https://stackoverflow.com/questions/54432583/when-should-i-not-want-to-use-pandas-apply-in-my-code/54432584#54432584) - always is necessary avoid loops. And `apply` are loops under the hood. – jezrael Oct 04 '22 at 13:31
@person Another problem `NaN` strings? Why? Testing by functions `isna()` not possible. – jezrael Oct 04 '22 at 13:33
@jezrael Thanks, I edited to take this into account. – Zamani Maxime Oct 05 '22 at 12:10

Filtering pandas dataframe when column contains stings

3 Answers3