0

Based on my specific issue I tried to create a small reproducible example without coming to a point. It is attached at the end of the question.

I am currently reading 27 excel files.

All formatted in the same way (apparently seems that all the column are consistent each other).

I am appending this data in a single DataFrame composed by 9828 rows and three columns.

One column called "Target" should be composed only by float or interger.

Actually cells that compose Target column are read as:

  1. Float
  2. In only one case an empty cell is read as a string
  3. In other cases empty cells are considered None

In order to understand which specific Column DataFrame row contains float/none/string I created a very inefficient way to check and split the dataframe based on the datatype

    dt_t is a dataframe with 3 columns one of this is called Target
    This column is composed by 9398 non-null object. 
    The other two by 9828 non-null object.

    df_string=pd.DataFrame()
    df_float=pd.DataFrame()
    df_null=pd.DataFrame()

    for i in range(len(df_t.Target)):
      if type(df_t.Target.loc[i])==str:
         df_string=df_string.append(df_t.loc[i])
      if type(df_t.Target.loc[i])==int or type(df_t.Target.loc[i])==float:
          df_float=df_float.append(df_t.loc[i])
      else: 
          df_null=df_null.append(df_t.loc[i])

I can't understand why:

  1. There is a duplicate. df_string and df_null contain the same value

  2. In the float dataframe I still have nan values

My expected output was to see

  1. Only float and int in one dataframe
  2. Only nan in df_null
  3. Only string in df_string dataframe

The most elegant way to extract nan values was in this answer but I think my issue is slightly different

df.loc[~df.index.isin(df.dropna().index)]

I tried also to find some resources and read the documentation to understand why pd.read_excel() has this behaviour, but I didn't find anything useful.

Here the code I tried to create for a reproducible question (but is not working):

    a=pd.Series(np.random.uniform(1.00,100.00,9000))
    b=np.empty(400)
    b=pd.Series(np.full_like(b,np.nan))
    c=pd.Series('None')
    start_dict={"Target":[0,2,3.5] }
    df_t=pd.DataFrame(start_dict)
    df_t=pd.concat([df_t,a],axis=0,ignore_index=True)
    df_t=pd.concat([df_t,b],axis=0,ignore_index=True)
    df_t=pd.concat([df_t,c],axis=0,ignore_index=True)

Here my actual Output when I print df_string and df_null:

enter image description here

Here in yellow the reference cell in the xlsx file:

enter image description here

Andrea Ciufo
  • 359
  • 1
  • 3
  • 19
  • 1
    Always set seed for random data, `np.random.seed(###)`, so values are reproducible. – Parfait Feb 07 '20 at 15:07
  • 1
    Can't quite see what you mean in reproducible example. Please show your issue claimed with *there is a duplicate. `df_float` and `df_null` contain the same value*. – Parfait Feb 07 '20 at 15:13
  • @Parfait 100% agree with the seed my mistake because my example was not working. What I should not understand was why the cell read as string in this categorization example was counted also in the `df_null` `if` cycle. I supposed that should be counted only in the `df_string` – Andrea Ciufo Feb 07 '20 at 17:52
  • 1
    Please show rather than tell this special string counted in both. NaNs can be of *any* type. – Parfait Feb 07 '20 at 22:21
  • @Parfait I made a mistake typing my question. I wrote `df_float` but I actually I would type `df_string.` This mistake was only in my question not in the code. I am very desolated and sorry about this I also read again the question before posting. Hope that now could be better – Andrea Ciufo Feb 11 '20 at 13:17
  • I still do not understand your problem. A `NaN` can exist in *any* column including string, so yes row can appear in both *df_string* and *df_null*. – Parfait Feb 11 '20 at 13:25
  • @Parfait Because is the same cell that is computed two times, one as Nan and one as a string. My goal was to analyze a specific column, so the output should be unique, not in different counters. Make sense that NaN can exist in any column, but I would read one column and identify each cell data type. – Andrea Ciufo Feb 24 '20 at 16:45

1 Answers1

1

Type is not safe to use. Since columns have mixed types pandas will use the object type for all rows. However, you can do this in another way.

First create the empty DataFrames:

df_string = pd.DataFrame()
df_float = pd.DataFrame()
df_null = pd.DataFrame()

Now you need to create a function that categorizes your original DataFrame. A slightly modified version of this question should do:

def categorize(x):
    if x.isna():
        df_null = df_null.append([x]) # append does not change the original df
        return
    try:
        float(x)
    except ValueError:
        df_string = df_string.append([x])
    else:
        df_float = df_float.append([x])

Now all you have to do is apply the function to the Target column.

df['Target'].apply(lambda x: categorize(x), axis=1)