1

The df I am working on looks like this

co1   col2
A     ['1','2','er']
A     []
B     ['1','3','4','abc']
B     ['5']
C     [] 

I want to calculate the % of each element in the list in col2 for each value in col1. i.e calculate % of 1 for A, calculate % of 2 for A, calculate % of abc for B I am looking for a solution to do this iteratively. Thanks

The link to input data (before explode) - [https://drive.google.com/file/d/1fuOBo8PK1heAtfufBlplXXfh4FiLpBCD/view?usp=sharing][1]

The link to output after explode - [https://drive.google.com/file/d/1mcArrsu3TWJC6hYZ2kIHAkAzCaHd1DLH/view?usp=sharing][2]

Sid
  • 552
  • 6
  • 21

1 Answers1

2

I believe you need DataFrame.explode with DataFrame.dropna:

#changed data for better sample     
print (df)
  col1           col2
0    A      [1, 2, 1]
1    A             []
2    B  [3, abc, abc]
3    B          [abc]
4    C             []

df2 = df.explode('col2').dropna(subset=['col2'])
print (df2)
  col1 col2
0    A    1
0    A    2
0    A    1
2    B    3
2    B  abc
2    B  abc
3    B  abc

And then SeriesGroupBy.value_counts:

df2 = df2.groupby('col1')['col2'].value_counts(normalize=True).reset_index(name='%')
print (df2)
  col1 col2         %
0    A    1  0.666667
1    A    2  0.333333
2    B  abc  0.750000
3    B    3  0.250000

EDIT:

import ast

df = pd.read_csv('beforeexplode.csv')

df['col2'] = df['col2'].apply(ast.literal_eval)
df2 = df.explode('col2').dropna(subset=['col2'])
print (df2)
     col1     col2
0    dev1  android
1    dev1  android
2    dev3     oscp
2    dev3     gpen
2    dev3      ceh
..    ...      ...
206  dev2     wcag
207  dev2    linux
207  dev2     unix
208  dev2    linux
208  dev2     unix

[460 rows x 2 columns]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • After explode I see that the values that are not present in col2 for a given value in col1 is also mapped against it. I mean the data is jumbled up. Any idea what might be causing it. – Sid Jun 13 '20 at 15:39
  • what I mean to say is that in your answer the col1 with value "C" doesnt have col2 value of "1" and "2" in the original df and it is not seen in the df after explode too. Where as I see this kind of non-existing relations built in my data after explode. can you plz help? – Sid Jun 16 '20 at 11:22
  • @Sid - Is possible add this new data to question? Or maybe create new question? – jezrael Jun 16 '20 at 11:24
  • Sure. There are more than 200 rows in my df, so have uploaded the before and after files my gdrive. Have added the links in the question – Sid Jun 16 '20 at 12:11
  • @Sid - Do you test this data and raised errors with solution above? – jezrael Jun 16 '20 at 12:12
  • @Sid - tested and working with convert to lists, like `df = pd.read_csv('beforeexplode.csv') df['col2'] = df['col2'].apply(ast.literal_eval) df2 = df.explode('col2').dropna(subset=['col2'])` – jezrael Jun 16 '20 at 12:15
  • I am getting ValueError: malformed node or string error at col2 – Sid Jun 16 '20 at 12:23
  • @Sid - Problem is in data `'beforeexplode.csv'` ? – jezrael Jun 16 '20 at 12:23
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/216055/discussion-between-sid-and-jezrael). – Sid Jun 16 '20 at 12:25
  • strangely, I applied the solution mentioned in the question of this post and it worked as expected. https://stackoverflow.com/questions/50217968/pandas-split-list-in-column-into-multiple-rows – Sid Jun 16 '20 at 13:12