Pandas Convert float column containing nan values to int for merge operation

Question

Attempt #1

s["order_id"].apply(lambda x: int(x) if pd.notnull(x) else np.nan)

Attempt #2

def to_int(x):
   if(pd.notnull(x)):
       return int(x)

Attempt #3

 s["order_id"] = s.loc[pd.notnull(s["order_id"]),"order_id].astype(int)

All of these return a series where the values are still formatted as floats.

I'm wondering if I could use the update function or take advantage of reindexing.

Leveraging Indexing solution attempt:

null = np.nan
data = {"time":{"0":1528971021539,"1":1529289904697,"2":1529572773525,"3":1529892602301,"4":1530082881098,"5":1530069453264,"6":1528985491630,"7":1529236762719,"8":1529475504491,"9":1529814085541,"10":1529906568681,"11":1530160346468,"12":1529833559160,"13":1530051985183,"14":1530240956273,"15":1529794554495,"16":1529892989425,"17":1529386510176,"18":1529118607780,"19":1529404958912,"20":1529812956409,"21":1530012703548,"22":1527815420250,"23":1527826735070,"24":1527832343938,"25":1527853694229,"26":1527889066223,"27":1527986243670,"28":1528070794031,"29":1528149294729,"30":1528158483701,"31":1528172242288,"32":1528173686892,"33":1528174729282,"34":1528175624472,"35":1528184014365,"36":1528184994544,"37":1528199211274,"38":1528204822424,"39":1528236692102,"40":1528246124079,"41":1528251449061,"42":1528254158311,"43":1528324045380,"44":1528409837346,"45":1528429172972,"46":1528453372400,"47":1528525996756,"48":1528530493509,"49":1528539093472},"user_id":{"0":1754627236948496,"1":4702200191313171,"2":4778254911976758,"3":8293985621789157,"4":5156436454415407,"5":4445821205748907,"6":6872300957263521,"7":579402494860,"8":2010389994610194,"9":3378398685582335,"10":2923987501904097,"11":7254681572754712,"12":2280706641994510,"13":5853777483445659,"14":1790488830140089,"15":4649841298300342,"16":8296801793054868,"17":6074985077237804,"18":7512067556495704,"19":7449962479289671,"20":931159100938705,"21":4303206141550631,"22":4931136210605885,"23":910152652690726,"24":213367265258802,"25":59665205254502,"26":7375134691043656,"27":5112755499047871,"28":1511225869347102,"29":6553192205018264,"30":5758319280291333,"31":5654341500640968,"32":8149628703137465,"33":6808112291514009,"34":3363098540596606,"35":4205809380744263,"36":3662128280212665,"37":986809097179824,"38":3834989038766064,"39":3561701388137551,"40":3363098540596606,"41":7998995390673240,"42":188780187662080,"43":290955994841187,"44":7996996554339358,"45":2624074855751159,"46":8317830532715985,"47":4819555707307085,"48":6662202062763635,"49":1363740504674809},"order_id":{"0":1161.0,"1":1175.0,"2":1186.0,"3":1200.0,"4":1217.0,"5":1213.0,"6":1162.0,"7":1171.0,"8":1183.0,"9":1192.0,"10":1205.0,"11":1219.0,"12":1195.0,"13":1212.0,"14":1221.0,"15":1190.0,"16":1201.0,"17":1166.0,"18":1167.0,"19":1181.0,"20":1191.0,"21":1211.0,"22":null,"23":null,"24":null,"25":null,"26":null,"27":null,"28":null,"29":null,"30":null,"31":null,"32":null,"33":null,"34":null,"35":null,"36":null,"37":null,"38":null,"39":null,"40":null,"41":null,"42":null,"43":null,"44":null,"45":null,"46":null,"47":null,"48":null,"49":null}}
s = pd.DataFrame(data=data)

orders = {"order_id":{"0":1161,"1":1175,"2":1205,"3":1219,"4":1195,"5":1212,"6":1221,"7":1190,"8":1201,"9":1166,"10":1167,"11":1181,"12":1186,"13":1191,"14":1211,"15":1200,"16":1217,"17":1213,"18":1162,"19":1171,"20":1183,"21":1192},"order_total":{"0":"206.50","1":"369.00","2":"313.65","3":"158.74","4":"164.50","5":"156.83","6":"184.50","7":"137.50","8":"120.00","9":"85.00","10":"369.00","11":"156.83","12":"184.50","13":"191.25","14":"297.50","15":"180.00","16":"394.40","17":"75.00","18":"191.25","19":"386.33","20":"95.00","21":"200.00"}}
o = pd.DataFrame(data=orders)

orders = s.loc[pd.notnull(s["order_id"])]
orders["order_id"] = orders["order_id"].astype(int)
s["order_total"] = np.nan
s.update(orders.merge(o, on='order_id', how='left').set_index(o.index)["order_total"])

You may want to have a look at https://stackoverflow.com/questions/759201/representing-integers-in-doubles. As long as your numbers are truly integers then you should have no issue using them as a join key for a merge even if they are stored as floats. There should not be a loss of precision at all, unless you need large integers. — ALollz, Jun 30 '18 at 05:33

jezrael · Accepted Answer · 2018-06-30T06:16:59.010

3

It is possible by hack, but not recommended, because some function should failed and worse performance:

s = pd.DataFrame({'order_id':[np.nan,8,9,4,2,3]})

s["order_id"] = s["order_id"].astype(object)
print (s)
  order_id
0      NaN
1        8
2        9
3        4
4        2
5        3

Docs:

In the absence of high performance NA support being built into NumPy from the ground up, the primary casualty is the ability to represent NAs in integer arrays.

In [20]: s = pd.Series([1, 2, 3, 4, 5], index=list('abcde'))

In [21]: s
Out[21]: 
a    1
b    2
c    3
d    4
e    5
dtype: int64

In [22]: s.dtype
Out[22]: dtype('int64')

In [23]: s2 = s.reindex(['a', 'b', 'c', 'f', 'u'])

In [24]: s2
Out[24]: 
a    1.0
b    2.0
c    3.0
f    NaN
u    NaN
dtype: float64

In [25]: s2.dtype
Out[25]: dtype('float64')

This trade-off is made largely for memory and performance reasons, and also so that the resulting Series continues to be 'numeric'. One possibility is to use dtype=object arrays instead.

EDIT:

There is possible remove NaNs by dropna and then convert to float if merge failed join between floats and integers:

orders = s = s.dropna(subset=['order_id'])
orders['order_id'] = orders['order_id'].astype(int)
#if want select only one column there was typos - ] and ) after s["order_id"]  
orders = s.loc[pd.notnull(s["order_id"]),"order_id"].astype(int)

orders.merge(df, on="order_id", how="left")

EDIT1:

orders = o.set_index('order_id')["order_total"]

s["order_total"] = s["order_id"].map(orders)
print (s.head(20))

             time           user_id  order_id order_total
0   1528971021539  1754627236948496    1161.0      206.50
1   1529289904697  4702200191313171    1175.0      369.00
10  1529906568681  2923987501904097    1205.0      313.65
11  1530160346468  7254681572754712    1219.0      158.74
12  1529833559160  2280706641994510    1195.0      164.50
13  1530051985183  5853777483445659    1212.0      156.83
14  1530240956273  1790488830140089    1221.0      184.50
15  1529794554495  4649841298300342    1190.0      137.50
16  1529892989425  8296801793054868    1201.0      120.00
17  1529386510176  6074985077237804    1166.0       85.00
18  1529118607780  7512067556495704    1167.0      369.00
19  1529404958912  7449962479289671    1181.0      156.83
2   1529572773525  4778254911976758    1186.0      184.50
20  1529812956409   931159100938705    1191.0      191.25
21  1530012703548  4303206141550631    1211.0      297.50
22  1527815420250  4931136210605885       NaN         NaN
23  1527826735070   910152652690726       NaN         NaN
24  1527832343938   213367265258802       NaN         NaN
25  1527853694229    59665205254502       NaN         NaN
26  1527889066223  7375134691043656       NaN         NaN

edited Jun 30 '18 at 06:16

answered Jun 30 '18 at 05:28

jezrael

822,522
95
1,334
1,252

Would it make more sense then to only convert when running the merge? – Yale Newman Jun 30 '18 at 05:29
@ALollz idc about there being floats I just need to have this column as my primary/foreign key for joins. – Yale Newman Jun 30 '18 at 05:31
@YaleNewman - I suggest 2 solutions - replace NaN to some integer like `s["order_id"] = s["order_id"].fillan(0).astype(int)` or remove rows with NaNs like `s = s.dropna(subset=['order_id'])`. Use `astype(object)` is hack. – jezrael Jun 30 '18 at 05:34
I would have another dataframe where there are no nulls and the order_id's are stored as ints. I'll post an attempt of what I have in my head in the question. – Yale Newman Jun 30 '18 at 05:38
I think here is best use `s1 = s.dropna(subset=['order_id'])` and then merge – jezrael Jun 30 '18 at 05:38
@YaleNewman - In my opinion idea is nice, do you need convert all columns to int or only `order_id` ? Or `s` is one column `DataFrame`? – jezrael Jun 30 '18 at 05:45
order_id is the only column i need to convert and it's only purpose to is to grab revenue data from another dataframe. i know the best solution would involve using indexing, i just forget the best way to approach it. i added an attempt using update but probably butchered the function. – Yale Newman Jun 30 '18 at 05:50
@YaleNewman - Hard to know what exactly need without your data. So if there is some problem, can ypu add some sample data with expected output to question? Thank you. – jezrael Jun 30 '18 at 05:56
I'm going to completely rewrite the question in a second with data and an example – Yale Newman Jun 30 '18 at 05:59
have a look when you get a chance:) – Yale Newman Jun 30 '18 at 06:12
1

@YaleNewman - Can you check solution? For me it working without casting to integers. It working for you too? – jezrael Jun 30 '18 at 06:17
i got it working with update. had to preserve index when merging, but i completely forgot about how map is essentially a vlookup. def the best solution. cheers – Yale Newman Jun 30 '18 at 06:21
care to share why map would be prefered to update? is it faster or just more readable? – Yale Newman Jun 30 '18 at 06:23
@YaleNewman - The best test it with real large data, but in my opinion it should be faster. – jezrael Jun 30 '18 at 06:24

Pandas Convert float column containing nan values to int for merge operation

1 Answers1