0

I have a data frame that looks something like this:

NUM   A      B        C      D        E        F
p1    NaN    -1.183   NaN    NaN      NaN      1.829711
p5    NaN    NaN      NaN    NaN      1.267   -1.552721
p9    1.138  NaN      NaN    -1.179   NaN      1.227306

There is always a non-NaN value in: column F and at least one other column A-E.

I want to create a sub-table containing only those rows which contain certain combinations of non-NaN values in columns. There are a number of these desired combinations including doublets and triplets. Here are examples of three such combinations I want to pull:

  1. Rows which contain non-NaN values in columns A & B
  2. Rows which contain non-NaN values in C & D
  3. Rows which contain non-NaN values in A & B & C

I already know about the np.isfinite and pd.notnull commands from this question but I do not know how to apply them to combinations of columns.

Also, once I have a list of commands for removing rows that do not match one of my desired combinations, I do not know how to tell Pandas to remove rows ONLY if they do not match any of the desired combinations.

Community
  • 1
  • 1
Slavatron
  • 2,278
  • 5
  • 29
  • 40
  • It would be nice if you add code to reproduce the data frame for others. That way, we can directly work on answering the question instead of spending time trying to setup an example data frame. – Phani Nov 11 '14 at 20:00
  • I think Taha's answer has the part that I was looking for. Cool. – Phani Nov 11 '14 at 20:03

3 Answers3

5

Many times, we will need to do logical operations on Boolean arrays (either numpy arrays or pandas series) as part of selecting a subset of a dataframe. Using 'and', 'or', 'not' operators for this will not work.

In [79]: df[pd.notnull(df['A']) and pd.notnull(df['F'])]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In Python, when using 'and', 'or' and 'not' operators, non-boolean variables are usually considered to be True unless they represent "empty" objects like [], int(0), float(0), None etc. So, it would be confusing to use these same operators for doing array-wise Boolean operations in Pandas. Some people would expect them to simply evaluate to True

Instead, we should use &, | and ~for this.

In [69]: df[pd.notnull(df['A']) & pd.notnull(df['F'])]
Out[69]:
  NUM      A   B   C      D   E         F
2  p9  1.138 NaN NaN -1.179 NaN  1.227306

An alternative shorter, but less flexible way to do this is to use any(), all() or empty.

In [78]: df[pd.notnull(df[['A', 'F']]).all(axis=1)]
Out[78]:
  NUM      A   B   C      D   E         F
2  p9  1.138 NaN NaN -1.179 NaN  1.227306

You can read more on this here

Phani
  • 3,267
  • 4
  • 25
  • 50
  • This is very helpful. I can use this to make a new dataframe for each combination-of-interest and then join them all together. – Slavatron Nov 11 '14 at 20:23
2

You can use apply and lambda function where you choose non-Nan value. You can verify if it's Nan value using Numpy.isNan(..).

data="""NUM   A      B        C      D        E        F
p1    NaN    -1.183   NaN    NaN      NaN      1.829711
p5    NaN    NaN      NaN    NaN      1.267   -1.552721
p9    1.138  NaN      NaN    -1.179   NaN      1.227306"""

import pandas as pd
from io import StringIO

df= pd.read_csv(StringIO(data.decode('UTF-8')),delim_whitespace=True )
print df



# Rows which contain non-NaN values in columns A & B
df["A_B"]= df.apply(lambda x: x['A'] if np.isnan(x['B']) else x['B'] if np.isnan(x['A']) else 0, axis=1)

# Rows which contain non-NaN values in C & D
df["C_D"]= df.apply(lambda x: x['C'] if np.isnan(x['D']) else x['D'] if np.isnan(x['C']) else 0, axis=1)

# Rows which contain non-NaN values in A & B & C
df["A_B_C"]= df.apply(lambda x: x['C'] if np.isnan(x['A_B']) else x['A_B'] if np.isnan(x['C']) else 0, axis=1)
print df

# Rows which contain non-NaN values in A & B & C
df["A_B_C_D"]= df.apply(lambda x: x['A_B'] if np.isnan(x['C_D']) else x['C_D'] if np.isnan(x['A_B']) else 0, axis=1)
print df

Output:

  NUM      A      B   C      D      E         F    A_B    C_D  A_B_C
0  p1    NaN -1.183 NaN    NaN    NaN  1.829711 -1.183    NaN -1.183
1  p5    NaN    NaN NaN    NaN  1.267 -1.552721    NaN    NaN    NaN
2  p9  1.138    NaN NaN -1.179    NaN  1.227306  1.138 -1.179  1.138

If you don't need to go through conditional cases, you can check the other way that is explained in the other post.

1

Let's say your dataframe is called df. You can use boolean masks like this.

# Specify column combinations that you want to pull 
combo1 = ['A', 'B'] 

# Select rows in the data frame that have non-NaN values in the combination
# of columns specified above

notmissing = ((df.loc[:, combo1].notnull()))
df = df.loc[notmissing, :] 
svenkatesh
  • 1,152
  • 2
  • 10
  • 25
  • This produced an error message: "ValueError: cannot copy sequence with size 2 to array axis with dimension 716". My dataframe has 716 rows... – Slavatron Nov 11 '14 at 20:00