Replacing with Nan

Question

I am trying to replace the placeholder '.' string with NaN in the total revenue column. This is the code used to create the df.

raw_data = {'Rank': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 
        'Company': ['Microsoft', 'Oracle', "IBM", 'SAP', 'Symantec', 'EMC', 'VMware', 'HP', 'Salesforce.com', 'Intuit'],
        'Company_HQ': ['USA', 'USA', 'USA', 'Germany', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA'], 
        'Software_revenue': ['$62,014', '$29,881', '$29,286', '$18,777', '$6,138', '$5,844', '$5,520', '$5,082', '$4,820', '$4,324'], 
        'Total_revenue': ['93,456', '38,828', '92,793', '23,289', '6,615', ".", '6,035', '110,577', '5,274', '4,573'],
        'Percent_revenue_total': ['66.36%', '76.96%', '31.56%', '80.63%', '92.79%', '23.91%', '91.47%', '4.60%', '91.40%', '94.55%']}
df = pd.DataFrame(raw_data, columns = ['Rank', 'Company', 'Company_HQ', 'Software_revenue', 'Total_revenue', 'Percent_revenue_total'])
df

I have tried using:

import numpy as np

df['Total_revenue'] = df['Total_revenue'].replace('.', np.nan, regex=True)
df

However, this replaces the entire column with Nan instead of just the placeholder '.' value.

df['Total_revenue'] = df['Total_revenue'].replace('.', 'nan', regex=True) , check https://stackoverflow.com/questions/46944650/replace-value-by-using-regex-to-np-nan — BENY, Sep 16 '19 at 01:38
First test if you can filter out the rows with “.” In them. And then set those values in the target column to “nan”. — CypherX, Sep 16 '19 at 01:38

score 0 · Answer 1 · answered Sep 16 '19 at 01:40

You only need to fix the regex=False. Because when you set it to True you are assuming the passed-in is a regular expression, setting it to False will treat the pattern as a literal string (which is what I believe you want):

import pandas as pd
raw_data = {'Rank': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 
        'Company': ['Microsoft', 'Oracle', "IBM", 'SAP', 'Symantec', 'EMC', 'VMware', 'HP', 'Salesforce.com', 'Intuit'],
        'Company_HQ': ['USA', 'USA', 'USA', 'Germany', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA'], 
        'Software_revenue': ['$62,014', '$29,881', '$29,286', '$18,777', '$6,138', '$5,844', '$5,520', '$5,082', '$4,820', '$4,324'], 
        'Total_revenue': ['93,456', '38,828', '92,793', '23,289', '6,615', ".", '6,035', '110,577', '5,274', '4,573'],
        'Percent_revenue_total': ['66.36%', '76.96%', '31.56%', '80.63%', '92.79%', '23.91%', '91.47%', '4.60%', '91.40%', '94.55%']}
df = pd.DataFrame(raw_data, columns = ['Rank', 'Company', 'Company_HQ', 'Software_revenue', 'Total_revenue', 'Percent_revenue_total'])

import numpy as np

df['Total_revenue'] = df['Total_revenue'].replace('.', np.nan, regex=False)
print(df)

Output:

  Rank         Company Company_HQ Software_revenue Total_revenue Percent_revenue_total
0     1       Microsoft        USA          $62,014        93,456                66.36%
1     2          Oracle        USA          $29,881        38,828                76.96%
2     3             IBM        USA          $29,286        92,793                31.56%
3     4             SAP    Germany          $18,777        23,289                80.63%
4     5        Symantec        USA           $6,138         6,615                92.79%
5     6             EMC        USA           $5,844           NaN                23.91%
6     7          VMware        USA           $5,520         6,035                91.47%
7     8              HP        USA           $5,082       110,577                 4.60%
8     9  Salesforce.com        USA           $4,820         5,274                91.40%
9    10          Intuit        USA           $4,324         4,573                94.55%

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

I went one step further here and changed the column type to numeric, so you can also use if for calculations.

df.Total_revenue  = pd.to_numeric(df.Total_revenue.str.replace(',',''),errors='coerce').astype('float')

df.Total_revenue

0     93456.0
1     38828.0
2     92793.0
3     23289.0
4      6615.0
5         NaN
6      6035.0
7    110577.0
8      5274.0
9      4573.0
Name: Total_revenue, dtype: float64

score 0 · Answer 3 · answered Sep 16 '19 at 01:52

. is special character in regex reprensent any character. You need escape it to make regex consider it as regular char

df['Total_revenue'].replace('\.', np.nan, regex=True)

Out[52]:
0     93,456
1     38,828
2     92,793
3     23,289
4      6,615
5        NaN
6      6,035
7    110,577
8      5,274
9      4,573
Name: Total_revenue, dtype: object

In your case, you should use mask

df['Total_revenue'].mask(df['Total_revenue'].eq('.'))

Out[58]:
0     93,456
1     38,828
2     92,793
3     23,289
4      6,615
5        NaN
6      6,035
7    110,577
8      5,274
9      4,573
Name: Total_revenue, dtype: object

Dev Khadka · Answer 4 · 2019-09-16T08:04:59.723

0

In my opinion "replace" is not required as user wanted to change "." Whole to nan. Inistead this will also work. It finds rows with "." And assign nan to it

df.loc[df['Total_revenue']==".", 'Total_revenue'] = np.nan

edited Sep 16 '19 at 08:04

answered Sep 16 '19 at 02:47

Dev Khadka

5,142
4
19
33

1

The provided answer was flagged for review as a Low Quality Post. Here are some guidelines for [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer). This provided answer may be correct, but it could benefit from an explanation. Code only answers are not considered "good" answers. From [review](https://stackoverflow.com/review/low-quality-posts/24057438). – Trenton McKinney Sep 16 '19 at 03:29

score 0 · Answer 5 · answered Sep 16 '19 at 11:01

you can try below to apply your requirement to DataFrame

df.replace('.', np.nan)

or of you want to make if for specific column then use df['Total_revenue'] instead of df

where below is the output:

     Rank       Company Company_HQ Software_revenue Total_revenue Percent_revenue_total
0     1       Microsoft        USA          $62,014        93,456                66.36%
1     2          Oracle        USA          $29,881        38,828                76.96%
2     3             IBM        USA          $29,286        92,793                31.56%
3     4             SAP    Germany          $18,777        23,289                80.63%
4     5        Symantec        USA           $6,138         6,615                92.79%
5     6             EMC        USA           $5,844           NaN                23.91%
6     7          VMware        USA           $5,520         6,035                91.47%
7     8              HP        USA           $5,082       110,577                 4.60%
8     9  Salesforce.com        USA           $4,820         5,274                91.40%
9    10          Intuit        USA           $4,324         4,573                94.55%

Replacing with Nan

5 Answers5

I went one step further here and changed the column type to numeric, so you can also use if for calculations.