I am trying to read a excel file using pandas. I am interested in reading only the relevant data from the excel file, i.e dropping the rows/column containing 'nan' value. I have encountered a problem where the first row of dataframe contains 'Unnamed' values. It is never fixed that my header would start from which row, therefore I am avoiding to use skiprows and header.
On using the below mentioned command it removed almost all the data from the dataframe as it is treating Unnamed as the header.
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
I have used the following command to clean my data:
data = pd.read_excel("text.xlsx", sheet_name=1,index=False)
print(data)
BINS 2018-RUI: Red Roof Inn Portfolio Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 5
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 No. Property \nID Property Name Street Address
4 1.001 10228 Red Roof Plus 777 Airport Boulevard
5 1.002 10150 Red Roof Plus1 15 Meadowlands Parkway
6 1.003 10304 Red Roof Inn Boulevard Seattle
data1 = data.dropna(axis = 0, how = 'all', thresh=3)
data2 = data1.dropna(axis = 1, how = 'all')
print(data2)
BINS 2018-RUI: Red Roof Inn Portfolio Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 5
3 No. Property \nID Property Name Street Address
4 1.001 10228 Red Roof Plus 777 Airport Boulevard
5 1.002 10150 Red Roof Plus1 15 Meadowlands Parkway
6 1.003 10304 Red Roof Inn Boulevard Seattle
Expected output:
3 No. Property \nID Property Name Street Address
4 1.001 10228 Red Roof Plus 777 Airport Boulevard
5 1.002 10150 Red Roof Plus1 15 Meadowlands Parkway
6 1.003 10304 Red Roof Inn Boulevard Seattle
I do not want the first rows having Unnamed written on cells. (This is small subset of data, actual data has 100 rows and 100 columns)