Taking the column names from the first row that has less than x Nan's

Question

I have data as follows:

import pandas as pd

url_cities="https://population.un.org/wup/Download/Files/WUP2018-F12-Cities_Over_300K.xls"
df_cities = pd.read_excel(url_cities)
print(df_cities.iloc[0:20,])

The column names can be found in row 15, but I would like this row number to be automatically determined. I thought the best way would be to take the first row for which the values are non-Nan for less than 10 items.

I combined this answer, to find this answer to do the following:

amount_Nan = df_cities.shape[1] - df_cities.count(axis=1)
# OR df.isnull().sum(axis=1).tolist()

print(amount_Nan)
col_names_index = next(i for i in amount_Nan if i < 3)
print(col_names_index)

df_cities.columns = df_cities.iloc[col_names_index]

The problem is that col_names_index keeps returning 0, while it should be 15. I think it is because amount_Nan returns rows and columns because of which next(i for i in amount_Nan if i < 3) works differently than expected.

The thing is that I do not really understand why. Can anyone help?

jezrael · Accepted Answer · 2022-05-23T07:49:26.923

1

IIUC you can get first index of non missing value per second column by DataFrame.iloc with Series.notna and Series.idxmax, set columns names by this row and filter out values before this row by index:

i = df_cities.iloc[:, 1].notna().idxmax()
              
df_cities.columns = df_cities.iloc[i].tolist()
df_cities = df_cities.iloc[i+1:]

edited May 23 '22 at 07:49

answered May 23 '22 at 07:42

jezrael

822,522
95
1,334
1,252

1

Thank you very much for your answer, I believe `df_cities = df_cities.iloc[i:]` should maybe be `df_cities = df_cities.iloc[i+1:]` right? – Tom May 23 '22 at 07:47
@Tom - You are right, just tested. – jezrael May 23 '22 at 07:49

Taking the column names from the first row that has less than x Nan's

1 Answers1