0

I am trying to apply a function convert_label() to a column CR_df['label'] of my dataframe. The outputs of the function are stored in a separate column CR_df['y']. However, my CR_df['label'] column has cells with NaN values. I want to apply my function only to cells in CR_df['label'] that don't have NaN values. If the cell does have an NaN value, I want to return NaN in the corresponding CR_df['y'] cell.

I don't want to check if I have NaN values, I need to return NaN if NaN.

My (error-prone) attempt at a solution

def convert_label(label):
    if "pos" in label:
        output = 1.0
    elif "neg" in label:
        output = 0.0
    else:
        output = label
    return output

I have tried to convert NaN to string and then applied my function but now I need to change all the string "nan" in CR_df['y'] to actual NaN or null values

CR_df['y'] = CR_df['label'].astype(str).apply(convert_label)

I've attached a picture of my output

enter image description here

Also, here is the code for my dataframe

    CR_train_file='data/custrev_train.tsv'
CR_test_file = 'data/custrev_test.tsv'


CR_train_df = pd.read_csv(CR_train_file, sep='\t', header=None)
CR_train_df.columns = ['index', 'label', 'review']
CR_test_df = pd.read_csv(CR_test_file, sep='\t', header=None)
CR_test_df.columns = ['index', 'review']
CR_test_df
CR_df = pd.concat([CR_train_df,CR_test_df], axis=0, ignore_index=True)
  • Does this answer your question? [How can I check for NaN values?](https://stackoverflow.com/questions/944700/how-can-i-check-for-nan-values) – richyen Dec 13 '19 at 21:11
  • No, it doesn't. My question is how to return NaN values – Ningkai Zheng Dec 13 '19 at 21:17
  • Could you add an example for your dataframe? – Soerendip Dec 13 '19 at 21:27
  • Example of dataframe is provided above – Ningkai Zheng Dec 13 '19 at 21:32
  • Instead of a picture, you could post the code to generate such a dataframe, next time. – Soerendip Dec 13 '19 at 21:33
  • Sorry about that! I've attached the code now – Ningkai Zheng Dec 13 '19 at 21:38
  • You're checking for a `subtring in str` but is that necessary? Seems to be exact values given your example. Could you `df.label.map({'pos': 1, 'neg': 0}).fillna(df.label)` The `fillna` deals with anything that wasn't mapped and satisfies your condition of keeping `NaN` as `NaN` – ALollz Dec 13 '19 at 21:44
  • Please share code/data as text in the post itself, not as images. I know you already accepted an answer, but I’m not satisfied with it, one reason being what @ALollz pointed out. – AMC Dec 13 '19 at 22:47
  • Also, what possible values can `label` and `y` take? I feel like at least `y` should be a boolean. – AMC Dec 13 '19 at 23:05
  • @ALollz The `fillna()` isn't even necessary in that case, if the value is not in the dictionary the result is `NaN`. – AMC Dec 13 '19 at 23:21

3 Answers3

0

You should be able to use float to assign a variable with NaN:

>>> import math
>>> a = float('nan')
>>> math.isnan(a)
True
>>> b = 'nan'
>>> math.isnan(b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: must be real number, not str
>>> 

In the above, a got assigned an actual NaN value, while b just had a 'nan' string

richyen
  • 8,114
  • 4
  • 13
  • 28
  • 1
    is there a way to do this without looping through my dataframe? In addition, how would I apply the function directly and return NaN without having to convert everything to strings first? – Ningkai Zheng Dec 13 '19 at 21:33
  • @NingkaiZheng No to the first question. As for the second one, aren't the items already strings? – AMC Dec 16 '19 at 00:30
0

Unlike the currently accepted answer, which uses None for some reason, this does fulfill the condition I don't want to check if I have NaN values, I need to return NaN if NaN.


Hmm, we're mapping a what is essentially a binary variable (plus NaN, of course) to 0.0 or 1.0. Sounds to me like we need some booleans™.

df_1 = pd.DataFrame(data=[('646', 'pos', 'bla bla 1'), ('2910', 'neg', 'bla bla 2'),
                          ('49', np.NaN, 'bla bla 3')], columns=['index_num', 'label', 'review'])

# accessing columns like civilized beings
df_1['y'] = df_1['label'].map({'pos': True, 'neg': False})

Before:

  index_num label     review
0       646   pos  bla bla 1
1      2910   neg  bla bla 2
2        49   NaN  bla bla 3

After:

  index_num label     review      y
0       646   pos  bla bla 1   True
1      2910   neg  bla bla 2  False
2        49   NaN  bla bla 3    NaN

Even if, hypothetically, we were to map to 0.0 or 1.0, there's still no excuse for an apply() with an entire function.

AMC
  • 2,642
  • 7
  • 13
  • 35
-1

You could modify your function so that it checks for None. If you dont want to do that you could check for None (or NaN, depending on your needs) inside the apply call with a lambda function.

import pandas as pd
import numpy as np

df = pd.DataFrame({'label': [1, np.NaN, 'neg',[2], 
                         3, 'pos', 5, None,
                         np:NaN, 'test']})

def convert_label(label):
    if "pos" == label:
        return 1.0
    elif "neg" == label:
        return 0.0
    else:
        return label

df.label.apply(lambda x: convert_label(x) if x is not np.NaN else np.NaN)
>>>
0       1
1    None
2       0
3     [2]
4       3
5       1
6       5
7    None
8    None
9    test
Name: label, dtype: object

Or you use DataFrame().where():

df.label.where(~df.label.isnull(), lambda x: convert_label(x))
Soerendip
  • 7,684
  • 15
  • 61
  • 128