3

I am attempting to parse Storm Data from NOAA. After some cleaning and parsing I am left with a DataFrame much like this:

import pandas as pd

data = { 'ID' : [1, 2, 3, 4, 5, 6],  
         'EVENT_TYPE': ['Flood', 'Hail', 'Fire', 'Tornado', 'Flood', 'Fire'],  
         'Property_Damage': ['2.5K', 0, '.4M', "1.00K", NaN, "1K"]}  

df = pd.DataFrame(data)  

For the record, this example DataFrame is just a simplification. The real DataFrame has approx. 25 columns and 22 million rows. I would like to convert the values in df['Property_Damage'] from strings to numeric values. My desired outcome would look similar to [2500, 0, 400000, 1000, 0, 1000].

I understand that I am making an assumption that NaN values can be replaced with 0. I tried to split records into multiple columns with

damage_property_split = df['Propery_Damage'].str.split([-1], expand=True) 

but this does not work with records that are 0 or NaN.

What is the best way to convert

['2.5K', 0, '.4M', "1.00K", NaN, "1K"] to [2500, 0, 400000, 1000, 0, 1000] ?

Thank you for any help!!!

Tommy
  • 695
  • 2
  • 10
  • 15
  • Be careful with the `NaN` values, they could mean that no data is available. Substituting them with a `0` could give wrong averages, etc. – user2314737 Feb 19 '18 at 19:46
  • 1
    This has been answered previously: https://stackoverflow.com/questions/39684548/convert-the-string-2-90k-to-2900-or-5-2m-to-5200000-in-pandas-dataframe The response at the bottom by ChiTownDataDude looks like what you need. – syntax_fire Feb 19 '18 at 19:43

1 Answers1

4

I'm a fan of this approach

mapping = dict(K='E3', M='E6', B='E9')

df.assign(Property_Damage=pd.to_numeric(
    df.Property_Damage.replace(mapping, regex=True)))

  EVENT_TYPE  ID  Property_Damage
0      Flood   1           2500.0
1       Hail   2              0.0
2       Fire   3         400000.0
3    Tornado   4           1000.0
4      Flood   5              NaN
5       Fire   6           1000.0

You can get your NaN filled with 0

mapping = dict(K='E3', M='E6', B='E9')

df.assign(Property_Damage=pd.to_numeric(
    df.Property_Damage.fillna(0).replace(mapping, regex=True)))

  EVENT_TYPE  ID  Property_Damage
0      Flood   1           2500.0
1       Hail   2              0.0
2       Fire   3         400000.0
3    Tornado   4           1000.0
4      Flood   5              0.0
5       Fire   6           1000.0
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • I have some "trouble maker" records that are throwing a monkey wrench into the approach you suggested. Would records that contain `'0K'` or `'0.00K'` cause errors? I keep getting `cannot parse "E3"` errors after running the `mapping` and `df.assign...` statements – Tommy Feb 19 '18 at 22:00
  • 1
    No `float('0E3')` parses fine. It's `float('E3')` that's an issue because you have blank strings in your data. Or maybe not. Not sure what it is. Find the trouble makers and let me know what they are. Lunch time. I'll check on you after I'm back. – piRSquared Feb 19 '18 at 22:03
  • Turns out there were several records that just contained `K`. Replaced those instances with zeros. After that, your suggestions worked like a champ. Thank you for all your help! – Tommy Feb 19 '18 at 23:04