1

Right now I am struggling with the following problem. I have in my dataframe two different IDs, the leading_ID and the follower_ID (String). The leading_ID is characterised by one unique ID. However the follower_ID has three different stages: first the ID can be assigned by None, second the ID can be assigned by one ID or third, the ID has multiple IDs which are separated by comma. The following table shows the structure specifically:

leading_ID   follower_ID
abcd         None
dfgh         cvnr,eee,rrrr
jrtz         brtz
vvvv         tttt,dddd
wwww         None
...          ...

My goal is separate all the follower_IDs and group them along the leading_ID. Therefore all follower_IDs that are assigned by None should be dropped.

leading_ID   follower_ID
dfgh         cvnr
dfgh         eee
dfgh         rrrr
jrtz         brtz
vvvv         tttt
vvvv         dddd
...          ...

By executing my code is seems that it is never ending or taking to long.

from  itertools import product
df1 = pd.DataFrame([j for i in df['follower_ID'].dropna().apply(lambda x: x.split(',')).values 
                      for j in product(*i)], columns=df.leading_ID)

Methods which are described in How do I unnest (explode) a column in a pandas DataFrame? have not worked.

Daniel
  • 552
  • 2
  • 9
  • 29

1 Answers1

2

Use:

from itertools import chain

#remove missing values
df = df.dropna()
#for better performance list comprehension with split
s = [x.split(', ') for x in df['follower_ID']]
#repeat values with flattening
df = pd.DataFrame({
     'user' : df['leading_ID'].repeat([len(x) for x in s]),
    'follower_ID' : list(chain.from_iterable(s))
})

print (df)
   user follower_ID
1  dfgh        cvnr
1  dfgh         eee
1  dfgh        rrrr
2  jrtz        brtz
3  vvvv        tttt
3  vvvv        dddd
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252