Based on my specific issue I tried to create a small reproducible example without coming to a point. It is attached at the end of the question.
I am currently reading 27 excel files.
All formatted in the same way (apparently seems that all the column are consistent each other).
I am appending this data in a single DataFrame composed by 9828 rows and three columns.
One column called "Target" should be composed only by float or interger.
Actually cells that compose Target column are read as:
- Float
- In only one case an empty cell is read as a string
- In other cases empty cells are considered
None
In order to understand which specific Column DataFrame row contains float/none/string I created a very inefficient way to check and split the dataframe based on the datatype
dt_t is a dataframe with 3 columns one of this is called Target
This column is composed by 9398 non-null object.
The other two by 9828 non-null object.
df_string=pd.DataFrame()
df_float=pd.DataFrame()
df_null=pd.DataFrame()
for i in range(len(df_t.Target)):
if type(df_t.Target.loc[i])==str:
df_string=df_string.append(df_t.loc[i])
if type(df_t.Target.loc[i])==int or type(df_t.Target.loc[i])==float:
df_float=df_float.append(df_t.loc[i])
else:
df_null=df_null.append(df_t.loc[i])
I can't understand why:
There is a duplicate.
df_string
anddf_null
contain the same valueIn the float dataframe I still have
nan
values
My expected output was to see
- Only float and int in one dataframe
- Only nan in df_null
- Only string in df_string dataframe
The most elegant way to extract nan values was in this answer but I think my issue is slightly different
df.loc[~df.index.isin(df.dropna().index)]
I tried also to find some resources and read the documentation to understand why pd.read_excel()
has this behaviour, but I didn't find anything useful.
Here the code I tried to create for a reproducible question (but is not working):
a=pd.Series(np.random.uniform(1.00,100.00,9000))
b=np.empty(400)
b=pd.Series(np.full_like(b,np.nan))
c=pd.Series('None')
start_dict={"Target":[0,2,3.5] }
df_t=pd.DataFrame(start_dict)
df_t=pd.concat([df_t,a],axis=0,ignore_index=True)
df_t=pd.concat([df_t,b],axis=0,ignore_index=True)
df_t=pd.concat([df_t,c],axis=0,ignore_index=True)
Here my actual Output when I print df_string
and df_null
:
Here in yellow the reference cell in the xlsx
file: