2

I have a dataframe as in the image. (the numbers are really objects). Doing df.treasury_rate = pd.to_numeric(df.treasury_rate), predictably bombs. However, doing df.replace('.', np.nan) does not appear to get rid of the dot, so I am flummoxed. Any suggestions?

UPDATE pd.to_numeric takes an errors keyword, which, when set to coerce does the right thing, however, I am still confused as to why the . is not getting replaced.

UPDATE 2 As text:

treasury_rate
1962-02-09 4.05
1962-02-10 4.05
1962-02-11 4.05
1962-02-12 .
1962-02-13 4.03
1962-02-14 4.03
1962-02-15 4.02 
1962-02-16 4.02
1962-02-17 4.02
1962-02-18 4.02
tdy
  • 36,675
  • 19
  • 86
  • 83
Igor Rivin
  • 4,632
  • 2
  • 23
  • 35

4 Answers4

4
df.replace('\.','0', regex=True,inplace=True)

I think you have to give regex for '.' character to replace it.

Nusrath
  • 499
  • 1
  • 4
  • 16
  • I am confused, still - if regex is not true, why would it (pandas) interpret a dot as a regex? Seems to be a bug... – Igor Rivin Feb 07 '19 at 04:04
  • It's not a bug, as @cullzie mentions in his answer below, the important part was using the parameter inplace=True otherwise results aren't stored to the df. – Brandon Bertelsen Feb 07 '19 at 04:12
  • @igor Rivin if regex is not true pandas interpreting it as only '.' char not as dot regex df0.replace('.','0'). where as inplace=True is just making change to your original dataframe , it works like df =df.replace('.','0') – Nusrath Feb 07 '19 at 06:22
3

In this case creating a mask / subset and replacing the string literal would be better than attempting to use replace. You'll probably also want to convert your series to numeric you can also do it all in one step if you know your data is clean other than the "." character:

subs = df.treasury_rate == "."
df.loc[subs, 'treasure_rate'] = np.nan
# OR
df.treasury_rate = pd.to_numeric('treasury_rate', errors='coerce')

You could likely run the last line and have the dots replaced with NaN

Brandon Bertelsen
  • 43,807
  • 34
  • 160
  • 255
2

I think you just need to add inplace=True if you want to update the df object inplace:

df.replace('.', np.nan, inplace=True)

Otherwise df.replace returns an new dataframe which you must assign to a variable to access:

new_df = df.replace('.', np.nan)
cullzie
  • 2,705
  • 2
  • 16
  • 21
0

Given OP's dataframe

import pandas as pd
import datetime as dt

df = pd.DataFrame({'treasury_rate': [4.05, 4.05, 4.05, ".", 4.03, 4.03, 4.02, 4.02, 4.02, 4.0]},
                    index=[dt.datetime(1962, 2, 9), dt.datetime(1962, 2, 10), dt.datetime(1962, 2, 11), dt.datetime(1962, 2, 12), dt.datetime(1962, 2, 13), dt.datetime(1962, 2, 14), dt.datetime(1962, 2, 15), dt.datetime(1962, 2, 16), dt.datetime(1962, 2, 17), dt.datetime(1962, 2, 18)])


[Out]:
           treasury_rate
1962-02-09          4.05
1962-02-10          4.05
1962-02-11          4.05
1962-02-12             .
1962-02-13          4.03
1962-02-14          4.03
1962-02-15          4.02
1962-02-16          4.02
1962-02-17          4.02
1962-02-18           4.0

There are various ways one can go about replacing the dot with NaN.

  1. Using regular expression (with Python's re)

    import re
    
    df_new = df.replace(re.compile('\.'), np.nan)
    
  2. Using a custom lambda function and np.nan

    import numpy as np
    
    df_new = df.treasury_rate.apply(lambda x: np.nan if x == '.' else x)
    
  3. Using pandas.DataFrame.replace and np.nan

    import numpy as np
    
    df_new = df.replace('.', np.nan)
    
    # or
    
    df_new = df.replace(to_replace=r'\.', value=np.nan, regex=True)
    

In all three cases the output is

            treasury_rate
1962-02-09           4.05
1962-02-10           4.05
1962-02-11           4.05
1962-02-12            NaN
1962-02-13           4.03
1962-02-14           4.03
1962-02-15           4.02
1962-02-16           4.02
1962-02-17           4.02
1962-02-18           4.00

Notes:

  • If one wants to do the operations on the same dataframe, one can use inplace=True such as

    df.replace(re.compile('\.'), np.nan)
    
  • Pandas has their own null value. So, instead of np.nan, one might want to use pd.NA. For more information on the difference between those two: pd.NA vs np.nan for pandas If one uses pd.NA, the output would be

               treasury_rate
    1962-02-09          4.05
    1962-02-10          4.05
    1962-02-11          4.05
    1962-02-12          <NA>
    1962-02-13          4.03
    1962-02-14          4.03
    1962-02-15          4.02
    1962-02-16          4.02
    1962-02-17          4.02
    1962-02-18           4.0
    
  • There are strong opinions on using .apply. If one wants to read some: When should I (not) want to use pandas apply() in my code?

Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83