4

I am trying to read a excel file using pandas. I am interested in reading only the relevant data from the excel file, i.e dropping the rows/column containing 'nan' value. I have encountered a problem where the first row of dataframe contains 'Unnamed' values. It is never fixed that my header would start from which row, therefore I am avoiding to use skiprows and header.

On using the below mentioned command it removed almost all the data from the dataframe as it is treating Unnamed as the header.

df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

I have used the following command to clean my data:

data = pd.read_excel("text.xlsx", sheet_name=1,index=False)
print(data)
BINS 2018-RUI: Red Roof Inn Portfolio   Unnamed: 1  Unnamed: 2  Unnamed: 3  Unnamed: 5
0   NaN       NaN       NaN     NaN     NaN     NaN
1   NaN       NaN       NaN     NaN     NaN     NaN
2   NaN       NaN       NaN     NaN     NaN     NaN
3   No. Property \nID   Property Name       Street Address  
4   1.001   10228        Red Roof Plus      777 Airport Boulevard       
5   1.002   10150        Red Roof Plus1     15 Meadowlands Parkway      
6   1.003   10304        Red Roof Inn       Boulevard Seattle
data1 = data.dropna(axis = 0, how = 'all', thresh=3)
data2 = data1.dropna(axis = 1, how = 'all')
print(data2)

BINS 2018-RUI: Red Roof Inn Portfolio   Unnamed: 1  Unnamed: 2  Unnamed: 3  Unnamed: 5
3   No. Property \nID   Property Name       Street Address  
4   1.001   10228        Red Roof Plus      777 Airport Boulevard       
5   1.002   10150        Red Roof Plus1     15 Meadowlands Parkway      
6   1.003   10304        Red Roof Inn       Boulevard Seattle   

Expected output:

3   No. Property \nID   Property Name       Street Address  
4   1.001   10228        Red Roof Plus      777 Airport Boulevard       
5   1.002   10150        Red Roof Plus1     15 Meadowlands Parkway      
6   1.003   10304        Red Roof Inn       Boulevard Seattle

I do not want the first rows having Unnamed written on cells. (This is small subset of data, actual data has 100 rows and 100 columns)

sid
  • 61
  • 1
  • 4
  • did you try [`skiprows`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html). Actually this question is maybe a duplicate of [this](https://stackoverflow.com/questions/27325652/python-pandas-read-csv-skip-rows-but-keep-header) – Quickbeam2k1 Nov 10 '19 at 10:46
  • 1
    Your question is unclear. What is the output of `df.loc[:, ~df.columns.str.contains('^Unnamed')]` and what is wrong with it? – gosuto Nov 10 '19 at 11:13
  • import pandas as pd import numpy as np data = pd.read_excel("work.xlsx",header=5) data = data.dropna(axis = 1) This will skip the first 5 rows assign header with content in that row – Sebin Sunny Nov 10 '19 at 11:24

1 Answers1

3

Given that you do not know how many rows to skip, removing all NA values like you do is fine.

The missing steps are to set the first (notna) row as header:

data.columns = data.iloc[0]

and then remove that row from the dataset:

data = data.iloc[1:,].reindex()
hirolau
  • 13,451
  • 8
  • 35
  • 47
  • +1 This helped me in my case also. Could you be kind to look at my [question here](https://stackoverflow.com/questions/63482547/skip-initial-empty-rows-and-columns-while-reading-in-pandas/) and help me if possible? It is almost similar to this. – Avinash Aug 20 '20 at 05:42