1

I created a DataFrame using the compare function to compare two datasets from two different years. If the values in the sub-indexed columns (2021, 2023) for the main column (e.g. company A) are equal, both of the columns will display NaN.

From here I want to select a main column (e.g. company A), and only display the rows where one or both of the values in the sub-indexed columns (e.g. 2021, 2023 for company A) are not Null. (If both are Null, do not display.)

cols = pd.MultiIndex.from_tuples([('company A', '2021'), ('company A', '2023'), ('company B', '2021'), ('company B', '2023')])
df = pd.DataFrame([[np.nan,np.nan,np.nan,np.nan], [3,4,np.nan,np.nan], [np.nan,np.nan,np.nan,6]], columns=cols)
df

  company A      company B     
       2021 2023      2021 2023
0       NaN  NaN       NaN  NaN
1       3.0  4.0       NaN  NaN
2       NaN  NaN       NaN  6.0

Desired DataFrame:

  company A      company B     
       2021 2023      2021 2023
1       3.0  4.0       NaN  NaN
Ciara Spencer
  • 111
  • 2
  • 11
  • 1
    [Please do not upload images of code/data/errors.](//meta.stackoverflow.com/q/285551) You don't _have_ to offer an [MRE](https://stackoverflow.com/help/minimal-reproducible-example), but if you do you're more likely to get high-quality answers. You might follow [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) advice. – J_H Aug 02 '23 at 02:21
  • @J_H The thing is, I didn't create this DataFrame using multiIndex, it was created using [pandas.DataFrame.compare](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.compare.html) - and my question is on how to reference these sub-indexed values. If I knew how to write an MRE to create the DataFrame for this, my question wouldn't exist. Also, the only seemingly viable solution included in the link you provided in reference to displaying multi-indexed data, says to use set_index to recreate the DataFrame, which again, circumvents the purpose of my question. – Ciara Spencer Aug 02 '23 at 03:09
  • @CiaraSpencer the thing that J_H is saying is basically, if you are comparing df1 and df2, include the code for creating df1 and df2. df.to_dict() might be useful – Mark Aug 02 '23 at 03:25
  • also, when you say you want to select a main column, it would be great if you could specify which column that would be – Mark Aug 02 '23 at 03:26
  • 1
    `df['Veteran'].dropna(how='all')`? – mozway Aug 02 '23 at 03:36
  • You chose not to follow the [advice](https://stackoverflow.com/a/20159305) on how to put DataFrame contents into a question for use by SO contributors. Your decision makes me sad. – J_H Aug 02 '23 at 03:49
  • @J_H, the advice is to not use multi-index when including an example DataFrame, which is relevant to my question. I edited to add example DataFrame content mimicking the output of the compare function I used. – Ciara Spencer Aug 02 '23 at 03:59
  • @CiaraSpencer your reproducible example could easily set the multi-index as a second step. – Paul H Aug 02 '23 at 04:15

4 Answers4

1

@mozway suggests that you accomplish your goal with .dropna in this way:

>>> cols = pd.MultiIndex.from_tuples(
...     [
...         ("company A", "2021"),
...         ("company A", "2023"),
...         ("company B", "2021"),
...         ("company B", "2023"),
...     ]
... )
>>> df = pd.DataFrame(
...     [
...         [np.nan, np.nan, np.nan, np.nan],
...         [3,      4,      np.nan, np.nan],
...         [np.nan, np.nan, np.nan, 6],
...     ],
...     columns=cols,
... )
>>> df
  company A      company B     
       2021 2023      2021 2023
0       NaN  NaN       NaN  NaN
1       3.0  4.0       NaN  NaN
2       NaN  NaN       NaN  6.0
>>> 
>>> df['company A'].dropna(how='all')
   2021  2023
1   3.0   4.0
>>> 
>>> df['company B'].dropna(how='all')
   2021  2023
2   NaN   6.0
J_H
  • 17,926
  • 4
  • 24
  • 44
  • This works.. However, my goal is to retain the remaining columns as well while filtering rows for the selected main column. I have added to this answer. – Ciara Spencer Aug 02 '23 at 05:29
1

Maybe you can try .loc function in python, by taking out the index after dropna

df.loc[df['company B'].dropna(how="all").index]

Result:


company A   company B
2021 2023   2021 2023
NaN  NaN     NaN  6.0
abhi
  • 337
  • 1
  • 3
  • 12
1

In addition to the answer provided by @J_H and @mozway, I used the following functions to display the full DataFrame after the sub-indexed columns are filtered for non-Null values.

cols = pd.MultiIndex.from_tuples([('company A', '2021'), ('company A', '2023'), ('company B', '2021'), ('company B', '2023')])
df = pd.DataFrame([[np.nan,np.nan,np.nan,np.nan], [3,4,np.nan,np.nan], [np.nan,np.nan,np.nan,6]], columns=cols)

  company A      company B     
       2021 2023      2021 2023
0       NaN  NaN       NaN  NaN
1       3.0  4.0       NaN  NaN
2       NaN  NaN       NaN  6.0

df_A = df['company A'].dropna(how='all')

   2021  2023
1   3.0   4.0

df_A_ind= df_A.index.values.tolist()

[1]

df_filtered_a=df[df.index.isin(df_A_ind)]

  company A      company B     
       2021 2023      2021 2023
1       3.0  4.0       NaN  NaN
Ciara Spencer
  • 111
  • 2
  • 11
  • I hadn't understood that you wanted to keep all columns. IMO, using `dropna` in this context is not appropriate. You're generating a full intermediate DataFrame just to get its index and discard it immediately. Better use boolean indexing (see my answer). – mozway Aug 02 '23 at 06:07
1

If you want to filter rows based on a single company, don't use dropna, rather use boolean indexing:

out = df[df['company B'].notna().any(axis=1)]

Output:

  company A      company B     
       2021 2023      2021 2023
2       NaN  NaN       NaN  6.0

Alternatively, use the subset parameter of dropna:

cols = df.columns[df.columns.get_loc('company B')]

out = df.dropna(subset=cols, how='all')

mozway
  • 194,879
  • 13
  • 39
  • 75
  • What is the reason for using Boolean Indexing rather than dropna? – Ciara Spencer Aug 02 '23 at 06:38
  • As I commented below your answer: efficiency. With boolean indexing the intermediate is a Series. With `dropna`+`index` you generate a full DataFrame with all the original data as intermediate, just to use its Index. – mozway Aug 02 '23 at 06:46
  • That part made sense. I thought maybe you were suggesting a reason for it's preference over a similarly simple answer using .dropna() such as that used in @abhi 's answer. – Ciara Spencer Aug 02 '23 at 06:56