Python - Data Cleaning

Question

I have a dataframe with a shape (42538, 145) in which there are over 50 columns which have a NaN values for all the rows.

Example of the table here

I would like to drop these columns without specifying each and every column name in df.drop.

This question has been answered here: https://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-in-certain-columns-is-nan — Soumik Rakshit, Oct 10 '18 at 11:37
You really shouldn't provide sample data as an image. It's impossible to paste an image into a text editor and play with it. What you definitely should post is an expected output and any code you've tried so far. — Mad Physicist, Oct 10 '18 at 13:58

score 1 · Answer 1 · answered Oct 10 '18 at 13:52

You probably want to start with df.notnull to get the locations of all the non-NaNs.

You can then use df.any on the result, with axis set to zero to check all the columns for not-all-NaNness.

The resulting boolean series can be used to index your columns: Pandas Select DataFrame columns using boolean. There are a couple of different options:

df = df.iloc[:, df.notnull().any(axis=0).values]

sel = df.notnull().any(axis=0)
df = df[sel.index[sel]]

score 1 · Answer 2 · answered Oct 10 '18 at 13:58

1

You can use pd.DataFrame.dropna over axis=1:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [np.nan]*3,
                   'C': [4, 5, 6], 'D': [np.nan]*3})

df_new = df.dropna(axis=1)

print(df_new)

   A  C
0  1  4
1  2  5
2  3  6

answered Oct 10 '18 at 13:58

jpp

159,742
34
281
339

Further more you can set the parameter "how" to 'all' : pandas.DataFrame.dropna ( how = 'all' ) removes row or column if all values are NA. By default how = 'any'. [pandas.DataFrame.dropna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) – Hank Gordon Jul 02 '21 at 09:12

score 0 · Answer 3 · edited Jul 04 '21 at 08:41

0

Try this:

tmp_col = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]
df1.columns = tmp_col

df2 = df1[(df1[2] == 'RO En') | (df1[2] == 'RO En Adj')]
df2[['bp1','bp2']] = df2[6].str.split('-',expand=True)
df2[['mn1','mn2']] = df2[11].str.split(' ',expand=True)

df2['FN'] = df2[10] + ' ' + df2[11]


df2.loc[df2[2] == 'RO ', 'RT'] = ''
df2.loc[df2[2] == 'RO ', ''] = ''
df2.loc[df2[2] == 'RO ', ''] = df2['bp1']
df2.loc[df2[2] == 'RO ', ''] = df2[12]


df3 = df2[df2[8].str.contains('')]
df4 = df2[df2[8].str.contains(')]

print(df3,df4)

pm_col = []

df3 = df3[[10,11,'BP',16,15,15,17,15,'RT',14,21,19,'FN','mn2']]

df3.columns = pm_col

df4 = df4[[10,11,'BP',16,15,15,17,15,'RT',14,21,19,'FN','mn2']]
df4.columns = pm_col

edited Jul 04 '21 at 08:41

rizerphe

1,340
1
14
24

answered Jul 02 '21 at 08:48

mayank584

19
2

While this code may solve the question, [including an explanation](//meta.stackexchange.com/q/114762) of how and why this solves the problem would really help to improve the quality of your post, and probably result in more up-votes. Remember that you are answering the question for readers in the future, not just the person asking now. Please [edit] your answer to add explanations and give an indication of what limitations and assumptions apply. – rizerphe Jul 03 '21 at 03:18
Please format your code properly, [click here to learn how](https://stackoverflow.com/help/formatting). – rizerphe Jul 03 '21 at 03:20

Python - Data Cleaning

3 Answers3