1

I have a df:

df =
     c1  c2   c3   c4  c5
  0  K   6    nan  Y   V
  1  H   nan  g    5   nan
  2  U   B    g    Y   L

And a string

s = 'HKg5'

I want to return rows where s[0]=value of c1, s[1]=value of c2, ..... + in some cases where s[i]=nan.

For example, row 1 in df above matches with the string

    row 1=
           c1  c2   c3   c4  c5
        1  H   nan  g    5   nan
                                                match=True,   regardless of s[1,4]=nan
     s   = H   K    g    5

And also the string length is dynamic, so my df cols go above c10

I am using df.apply but I can't figure it out clearly. I want to write a function to pass to df.apply, passing the string at the same time.

Thanks for any help!

Output from Chris's answer

  df=  
        c1  c2  c3  c4  c5 
     0  K   6  NaN  Y   V
     1  H  NaN  g   5  NaN
     2  U   B   g   Y   L

  s = 'HKg5'
  s1 = pd.Series(list(s), index=[f'c{x+1}' for x in range(len(s))])
  df.loc[((df == s1) | (df.isna())).all(1)]

Output

  `c1  c2  c3  c4  c5`
pnna
  • 113
  • 1
  • 6

1 Answers1

2

Create a helper Series from your string and use boolean logic to filter:

s1 = pd.Series(list(s), index=[f'c{x+1}' for x in range(len(s))])

# print(s1)    
# c1    H
# c2    K
# c3    g
# c4    5
# dtype: object

Logic is df equals (==) this value OR (|) is nan (isna)
Use all along axis 1 to return rows where all values are True

df.loc[((df == s1) | (df.isna())).all(1)]

[out]

  c1   c2 c3 c4   c5
1  H  NaN  g  5  NaN

So, as a function, you could do:

def df_match_string(frame, string):
    s1 = pd.Series(list(string), index=[f'c{x+1}' for x in range(len(string))])
    return ((frame == s1) | (frame.isna())).all(1)

df_match_string(df, s)

[out]

0    False
1     True
2    False
dtype: bool

Update

I can't reproduce your issue with the example provided. My guess is that some of the values in your DataFrame may have leading/trailing whitespace?

Before trying the above solution, try this preprocessing step:

for col in df:
    df[col] = df[col].str.strip()
Chris Adams
  • 18,389
  • 4
  • 22
  • 39
  • Hello, I have a bit issue here with a warning: -----> FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison result = method(y) . Return is not correct – pnna Mar 04 '20 at 06:02
  • Looks like just a warning, caused by a bug in `numpy` check this answer [here](https://stackoverflow.com/a/46721064/10201580). If return is not correct, can you provide a reproducible example of a line that is not working as expected - with expected outcome? – Chris Adams Mar 04 '20 at 08:32
  • Please check I edited my original post and included your output, it returns an empty df @Chris A – pnna Mar 05 '20 at 01:54
  • I can't replicate the issue, code is working for me with this example. Only thing I can think is maybe some of you columns have leading or trailing whitespace..? for eg the value in row `1` at `c1` is actually `"H "` (notice the space after H) – Chris Adams Mar 05 '20 at 07:19