Filter rows based on contained strings then compare two columns in Python

Question

Given a toy dataset as follows:

   id room_type company_name
0   1    office      ABC ltd
1   2    office       retail
2   3    office      xyz ltd
3   4    retail       retail
4   5   parking    toy store
5   6      hall          NaN

If room_type or company columns contain retail, parking or hall, then compare two columns, if they are not same, then returns a new column check with string Invalid company name or room type.

I would like to use code as follows since there are many other columns to check:

a = np.where(df['room_type'].str.contains('retail|parking|hall', na = False), 'Invalid company name or room type', None)

# b = np.where(df.area.str.contains('^\d+$', na = True), None,
#                                  'area is not a numbers')  
f = (lambda x: ';'.join(y for y in x if pd.notna(y)) 
                if any(pd.notna(np.array(x))) else np.nan )
df['check'] = [f(x) for x in zip(a)]

The expected result will like this:

   id room_type company_name                              check
0   1    office      ABC ltd                                NaN
1   2    office       retail  Invalid company name or room type
2   3    office      xyz ltd                                NaN
3   4    retail       retail                                NaN
4   5   parking    toy store  Invalid company name or room type
5   6      hall          NaN  Invalid company name or room type

How could I modify the condition a code? Thanks for your help at advance.

score 1 · Accepted Answer · edited Nov 10 '20 at 09:26

1

Use Series.str.cat for join both columns, test for subtrings and for second condition compare by Series.ne for not equal, last chain conditions by | for bitwise AND:

m1 = (df['room_type'].str.cat(df['company_name'], sep=' ', na_rep='')
                     .str.contains('retail|parking|hall', na = False))
m2 = df['room_type'].ne(df['company_name'])

df['check'] = np.where(m1 & m2, 'Invalid company name or room type', None)
print(df)
   id room_type company_name                              check
0   1    office      ABC ltd                               None
1   2    office       retail  Invalid company name or room type
2   3    office      xyz ltd                               None
3   4    retail       retail                               None
4   5   parking    toy store  Invalid company name or room type
5   6      hall          NaN  Invalid company name or room type

edited Nov 10 '20 at 09:26

ah bon

9,293
12
65
148

answered Nov 10 '20 at 06:46

jezrael

822,522
95
1,334
1,252

Thanks. btw, is there a way to highlight the problematic cells and save the dataframe as excel file? – ah bon Nov 10 '20 at 09:13
1

@ahbon - no, it is no problem. Eg use [this](https://stackoverflow.com/a/51175719/2901002) and change `m = x['config_size_x'] != x['config_size_y']` to `m = x['check'].isna()` and then `df1.loc[m, ['config_size_x', 'config_size_y']] = c1` to `df1.loc[m, :] = c1` (maybe `:` should be omit) – jezrael Nov 10 '20 at 09:15
1

It's more appropriate to I post a new question, maybe this link will be helpful. https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html – ah bon Nov 10 '20 at 09:28
1

@ahbon - It is up to you ;) – jezrael Nov 10 '20 at 09:30
I posted here: https://stackoverflow.com/questions/64766324/highlight-dataframe-cells-based-on-multiple-conditions-in-python – ah bon Nov 10 '20 at 09:41

Filter rows based on contained strings then compare two columns in Python

1 Answers1