Grouping and Splitting Data from a single column of a Pandas Dataframe Based on datatype (float,Nan,string)

Question

Based on my specific issue I tried to create a small reproducible example without coming to a point. It is attached at the end of the question.

I am currently reading 27 excel files.

All formatted in the same way (apparently seems that all the column are consistent each other).

I am appending this data in a single DataFrame composed by 9828 rows and three columns.

One column called "Target" should be composed only by float or interger.

Actually cells that compose Target column are read as:

Float
In only one case an empty cell is read as a string
In other cases empty cells are considered None

In order to understand which specific Column DataFrame row contains float/none/string I created a very inefficient way to check and split the dataframe based on the datatype

    dt_t is a dataframe with 3 columns one of this is called Target
    This column is composed by 9398 non-null object. 
    The other two by 9828 non-null object.

    df_string=pd.DataFrame()
    df_float=pd.DataFrame()
    df_null=pd.DataFrame()

    for i in range(len(df_t.Target)):
      if type(df_t.Target.loc[i])==str:
         df_string=df_string.append(df_t.loc[i])
      if type(df_t.Target.loc[i])==int or type(df_t.Target.loc[i])==float:
          df_float=df_float.append(df_t.loc[i])
      else: 
          df_null=df_null.append(df_t.loc[i])

I can't understand why:

There is a duplicate. df_string and df_null contain the same value
In the float dataframe I still have nan values

My expected output was to see

Only float and int in one dataframe
Only nan in df_null
Only string in df_string dataframe

The most elegant way to extract nan values was in this answer but I think my issue is slightly different

df.loc[~df.index.isin(df.dropna().index)]

I tried also to find some resources and read the documentation to understand why pd.read_excel() has this behaviour, but I didn't find anything useful.

Here the code I tried to create for a reproducible question (but is not working):

    a=pd.Series(np.random.uniform(1.00,100.00,9000))
    b=np.empty(400)
    b=pd.Series(np.full_like(b,np.nan))
    c=pd.Series('None')
    start_dict={"Target":[0,2,3.5] }
    df_t=pd.DataFrame(start_dict)
    df_t=pd.concat([df_t,a],axis=0,ignore_index=True)
    df_t=pd.concat([df_t,b],axis=0,ignore_index=True)
    df_t=pd.concat([df_t,c],axis=0,ignore_index=True)

Here my actual Output when I print df_string and df_null:

Here in yellow the reference cell in the xlsx file:

Always set seed for random data, `np.random.seed(###)`, so values are reproducible. — Parfait, Feb 07 '20 at 15:07
Can't quite see what you mean in reproducible example. Please show your issue claimed with *there is a duplicate. `df_float` and `df_null` contain the same value*. — Parfait, Feb 07 '20 at 15:13
@Parfait 100% agree with the seed my mistake because my example was not working. What I should not understand was why the cell read as string in this categorization example was counted also in the `df_null` `if` cycle. I supposed that should be counted only in the `df_string` — Andrea Ciufo, Feb 07 '20 at 17:52
Please show rather than tell this special string counted in both. NaNs can be of *any* type. — Parfait, Feb 07 '20 at 22:21
@Parfait I made a mistake typing my question. I wrote `df_float` but I actually I would type `df_string.` This mistake was only in my question not in the code. I am very desolated and sorry about this I also read again the question before posting. Hope that now could be better — Andrea Ciufo, Feb 11 '20 at 13:17
I still do not understand your problem. A `NaN` can exist in *any* column including string, so yes row can appear in both *df_string* and *df_null*. — Parfait, Feb 11 '20 at 13:25
@Parfait Because is the same cell that is computed two times, one as Nan and one as a string. My goal was to analyze a specific column, so the output should be unique, not in different counters. Make sense that NaN can exist in any column, but I would read one column and identify each cell data type. — Andrea Ciufo, Feb 24 '20 at 16:45

score 1 · Accepted Answer · answered Feb 07 '20 at 15:05

Type is not safe to use. Since columns have mixed types pandas will use the object type for all rows. However, you can do this in another way.

First create the empty DataFrames:

df_string = pd.DataFrame()
df_float = pd.DataFrame()
df_null = pd.DataFrame()

Now you need to create a function that categorizes your original DataFrame. A slightly modified version of this question should do:

def categorize(x):
    if x.isna():
        df_null = df_null.append([x]) # append does not change the original df
        return
    try:
        float(x)
    except ValueError:
        df_string = df_string.append([x])
    else:
        df_float = df_float.append([x])

Now all you have to do is apply the function to the Target column.

df['Target'].apply(lambda x: categorize(x), axis=1)

Grouping and Splitting Data from a single column of a Pandas Dataframe Based on datatype (float,Nan,string)

1 Answers1