1

My question is little bit different than the question posted here

So I thought to open a new thread.I have a pandas data frame with 5 attributes.One of these attribute is created using pandas series.Here is the sample code for creating the data frame

import numpy as np
mydf1=pd.DataFrame(columns=['group','id','name','mail','gender'])
data = np.array([2540948, 2540955, 2540956,2540956,7138932])
x=pd.Series(data)    
mydf1.loc[0]=[1,x,'abc','abc@xyz.com','male']

I have another data frame,the code for creating the data frame is given below

mydf2=pd.DataFrame(columns=['group','id'])
data1 = np.array([2540948, 2540955, 2540956])
y=pd.Series(data1)
mydf2.loc[0]=[1,y]

These are sample data. Actual data will have large number of rows & also the series length is large too .I want to match mydf1 with mydf2 & if it matches,sometime I wont have matching element in mydf2,then I will delete values of id from mydf1 which are there in mydf2 for example after the run,my id will be for group 1 2540956,7138932. I also tried the code mentioned in above link. But for the first line

counts = mydf1.groupby('id').cumcount()

I got error message as TypeError: 'Series' objects are mutable, thus they cannot be hashed in my Python 3.X. Can you please suggest me how to solve this?

ayhan
  • 70,170
  • 20
  • 182
  • 203
Tanvi Mirza
  • 799
  • 2
  • 8
  • 14
  • Any suggestion please? – Tanvi Mirza Jan 14 '18 at 15:02
  • I need it very urgently.I will be glad if someone of you can suggest me a solution – Tanvi Mirza Jan 14 '18 at 16:19
  • Can you have more data...I cannot tell what you want from your description. – Tai Jan 14 '18 at 16:53
  • How to you match? What's the critirion? Do you match by group or by id? – Tai Jan 14 '18 at 16:54
  • Hi @Tai I will match by group which is 1 here for both the dataframe. Sorry I don't have more data. But the group contains unique value & id is pandas.series it has large number of values.It's length can be 10 K or more – Tanvi Mirza Jan 14 '18 at 17:17
  • id needs to be in order? and you want to remove the first N items? – Tai Jan 14 '18 at 17:20
  • @Tai,no need of order for ID.Not first N items. Say id in mydf1 is 1,2,3,4,5,5,7,6,6,8 & in mydf2 is 1,2,5,6,6, then id in c will be 3,4,5,7,8.Please note that the id value will be 8 digit number in original data & it's a pandas series object in ID column.I don't have any data with me currently.I'm expecting a work very soon & for that I'm preparing the code – Tanvi Mirza Jan 14 '18 at 17:27
  • Any suggestion please @Tai? – Tanvi Mirza Jan 14 '18 at 17:44
  • I posted my code. See it and let me know how it goes. – Tai Jan 14 '18 at 17:45
  • 1
    Thanks @Tai,Sure I will let you know – Tanvi Mirza Jan 14 '18 at 17:54

1 Answers1

0

This should work. We use Counter to find the difference between 2 lists of ids. (p.s. This problem does not requires the difference is in order.)

Setup

import numpy as np
from collections import Counter
mydf1=pd.DataFrame(columns=['group','id','name','mail','gender'])
x = [2540948, 2540955, 2540956,2540956,7138932]
y = [2540948, 2540955, 2540956,2540956,7138932]
mydf1.loc[0]=[1,x,'abc','abc@xyz.com','male']
mydf1.loc[1]=[2,y,'def','def@xyz.com','female']

mydf2=pd.DataFrame(columns=['group','id'])
x2 = np.array([2540948, 2540955, 2540956])
y2 = np.array([2540955, 2540956])
mydf2.loc[0]=[1,x2]
mydf2.loc[1]=[2,y2]

Code

mydf3 = mydf1[["group", "id"]]
mydf3 = mydf3.merge(mydf2, how="inner", on="group")

new_id_finder = lambda x: list((Counter(x.id_x) - Counter(x.id_y)).elements())

mydf3["new_id"] = mydf3.apply(new_id_finder, 1)
mydf3["new_id"]
    group   new_id
0   1       [2540956, 7138932]
1   2       [2540948, 2540956, 7138932]

One Counter object can substract another to get the difference in occurances of elements. Then, you can use elements function to retrieve all values left.

Tai
  • 7,684
  • 3
  • 29
  • 49