4

I have pandas.df 233 rows * 234 columns and I need to evaluate every cell and return corresponding column header if not nan, so far I wrote the following:

#First get a list of all column names (except column 0):

col_list=[]

for column in df.columns[1:]:
    col_list.append(column)

#Then I try to iterate through every cell and evaluate for Null
#Also a counter is initiated to take the next col_name from col_list
#when count reach 233

for index, row in df.iterrows():
    count = 0
    for x in row[1:]:
        count = count+1
        for col_name in col_list:
            if count >= 233: break
            elif str(x) != 'nan':
                print col_name 

The code does not do exactly that, what do I need to change to get the code to break after 233 rows and go to the next col_name?

Example:

    Col_1   Col_2    Col_3
1    nan     13       nan
2    10      nan      nan
3    nan      2        5
4    nan     nan       4

output:      
1   Col_2
2   Col_1
3   Col_2
4   Col_3
5   Col_3
haytham
  • 502
  • 4
  • 22

3 Answers3

6

I think you need if first column is index stack - it remove all NaNs and then get values from second level of Multiindex by reset_index and selecting or by Series constructor with Index.get_level_values:

s = df.stack().reset_index()['level_1'].rename('a')
print (s)
0    Col_2
1    Col_1
2    Col_2
3    Col_3
4    Col_3
Name: a, dtype: object

Or:

s = pd.Series(df.stack().index.get_level_values(1))
print (s)
0    Col_2
1    Col_1
2    Col_2
3    Col_3
4    Col_3
dtype: object

If need output as list:

L = df.stack().index.get_level_values(1).tolist()
print (L)
['Col_2', 'Col_1', 'Col_2', 'Col_3', 'Col_3']

Detail:

print (df.stack())
1  Col_2    13.0
2  Col_1    10.0
3  Col_2     2.0
   Col_3     5.0
4  Col_3     4.0
dtype: float64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • How to include "NaN" (or similar) for rows with only "nan" (i.e. do not skip / drop all-nan- rows)? – nutty about natty Apr 01 '22 at 09:34
  • 1
    @nuttyaboutnatty - use `L = df.stack(dropna=False).index.get_level_values(1).tolist()` – jezrael Apr 01 '22 at 09:35
  • Doesn't work (as intended)... I have 43 rows, with only 2 rows "nan" throughout; in your original answer I get a column with 41 rows, i.e. those 2 rows are skipped; with ´dropna=False´ I get 215 rows - and none of which have "nan". Also, **for the 2 skipped rows I'd like to see "nan" (instead of the respective column heading)** in the new column... (Why 215 ? Because in my case I have 5 columns + 43 rows...) – nutty about natty Apr 01 '22 at 09:55
  • @nuttyaboutnatty - How looks expected ouput if `nan = np.nan` `df = pd.DataFrame({'Col_1': [nan, 10.0, nan, nan], 'Col_2': [13.0, nan, 2.0, nan], 'Col_3': [nan, nan, nan, nan]})` ? – jezrael Apr 01 '22 at 10:08
  • Related / useful Q&A's: [How do I transform a Panda Dataframe from 5 columns into 1 column using column header as value to populate?](https://stackoverflow.com/questions/49404068/how-do-i-transform-a-panda-dataframe-from-5-columns-into-1-column-using-column-h), [How to collapse multiple columns into one in pandas](https://stackoverflow.com/a/51103859/2153622) (bfill / fillna) and [How to transform multiple boolean columns to one column with column headers and NaN?](https://stackoverflow.com/questions/67095325/how-to-transform-multiple-boolean-columns-to-one-column-with-column-headers-and). – nutty about natty Apr 01 '22 at 10:22
  • @nuttyaboutnatty - how looks expected ouput from DataFrame posted in my comment above? – jezrael Apr 01 '22 at 10:30
  • For that 3 columns x 4 rows dataframe I get **12 entries** (*the equivalent to the 215 entries in the earlier comment above*) in that column (`print(s)`, where the modified `s` would be `s = pd.Series(df.stack(dropna=False).index.get_level_values(1))` ). - ***Instead***, I'd want to see **only 4 entries** in `s`, with the fourth being "nan". That is, for this dataframe I'd want to generate a column with four entries (i.e., rows) **Col_2, Col_1, Col_2, N/A**. – nutty about natty Apr 01 '22 at 10:42
  • 1
    @nuttyaboutnatty - I think you should post question. Answer should be `df = df.assign(**{'N/A': np.where(df.isna().all(axis=1), 1, np.nan)}).stack().index.get_level_values(1).tolist()` – jezrael Apr 01 '22 at 10:48
  • "Problem solved", but still need to get my head around (and my hands dirty with) ["multi-index"](https://towardsdatascience.com/working-with-multi-index-pandas-dataframes-f64d2e2c3e02); cool stuff. Happy to spawn a separate question from the comments above (soon). Many thanks to you! – nutty about natty Apr 01 '22 at 11:14
  • @nuttyaboutnatty - hmmm, it is up to you. You are welcome! – jezrael Apr 01 '22 at 11:16
  • Another expansion of the original problem/question: how can I exclude/ignore some non-empty columns for the above? E.g., if I want to ignore columns 'bla' and 'bla bla' from this `df = pd.DataFrame({'Col_1': [nan, 10.0, nan, nan], 'Col_2': [13.0, nan, 2.0, nan], 'Col_3': [nan, nan, nan, nan], 'bla':['blaa','blaa','blaa','blaa'], 'bla bla':['blaaa','blaaa','blaaa','blaaa'], })` ? – nutty about natty Apr 05 '22 at 11:57
  • @nuttyaboutnatty - I think simpliest is remove them `df = df.drop(['bla','bla bla'], axis=1).assign(**{'N/A': np.where(df.isna().all(axis=1), 1, np.nan)}).stack().index.get_level_values(1).tolist()` – jezrael Apr 05 '22 at 11:59
  • I assume I'd need to tweak the `isna().all`-part... yesterday I applied the above to a subset of the original dataframe which is okay / workable but less elegant... – nutty about natty Apr 05 '22 at 12:00
  • wow, that was quick! I'd want to keep those columns, but only add that extra column with the column names for a subset of the columns... so no drop the 'bla'-columns completely, just ignore them for the extra column... `df = pd.DataFrame({'Col_1': [nan, 10.0, nan, nan], 'Col_2': [13.0, nan, 2.0, nan], 'Col_3': [nan, nan, nan, nan], 'bla':['blaa','blaa','blaa','blaa'], 'bla bla':['blaaa','blaaa','blaaa','blaaa'] })` (erroneous comma removed) – nutty about natty Apr 05 '22 at 12:02
  • ... or is there a neat / common way to ["temporarily" drop / ignore](https://stackoverflow.com/questions/14940743/selecting-excluding-sets-of-columns-in-pandas) (those) columns? – nutty about natty Apr 05 '22 at 12:10
  • @nuttyaboutnatty - in another words replace all values to `NaN`s in these columns? – jezrael Apr 05 '22 at 12:12
  • nope; ignore them so they won't add to the count of items in the list (or column). just pretent they weren't there. ultimately, just want this extra column listing the column headings for col_1 .. col_3 only; this column should have the same number of rows as the original df. – nutty about natty Apr 05 '22 at 12:13
  • @nuttyaboutnatty `just pretent they weren't there.` - I think it is not possible - only drop or replace content. – jezrael Apr 05 '22 at 12:14
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/243616/discussion-between-nutty-about-natty-and-jezrael). – nutty about natty Apr 05 '22 at 12:15
3

I'd use jezrael's stack solution.

However, if you're interested in Numpy way, which is usually faster.

In [4889]: np.tile(df.columns, df.shape[0])[~np.isnan(df.values.ravel())]
Out[4889]: array(['Col_2', 'Col_1', 'Col_2', 'Col_3', 'Col_3'], dtype=object)

Timings

In [4913]: df.shape
Out[4913]: (100, 3)

In [4914]: %timeit np.tile(df.columns, df.shape[0])[~np.isnan(df.values.ravel())]
10000 loops, best of 3: 35.8 µs per loop

In [4915]: %timeit df.stack().index.get_level_values(1)
1000 loops, best of 3: 335 µs per loop

In [4905]: df.shape
Out[4905]: (100000, 3)

In [4907]: %timeit np.tile(df.columns, df.shape[0])[~np.isnan(df.values.ravel())]
100 loops, best of 3: 5.98 ms per loop

In [4908]: %timeit df.stack().index.get_level_values(1)
100 loops, best of 3: 11.7 ms per loop

Choose based on your need (readability, speed, maintainability etc)

Zero
  • 74,117
  • 18
  • 147
  • 154
1

You can use dropna :

df.dropna(axis=1).columns

axis : {0 or ‘index’, 1 or ‘columns’}

how : {‘any’, ‘all’}

Basically you use dropna to remove the null, axis = 1 is dropping columns, and how="any" to remove is at least one in the columns is null, .columns get the remaining header.

Tbaki
  • 1,013
  • 7
  • 12