0

I have a dataframe like this:

df1
   a  b  c
0  1  2  [bg10, ng45, fg56]
1  4  5  [cv10, fg56]
2  7  8  [bg10, ng45, fg56]
3  7  8  [fg56, fg56]
4  7  8  [bg10]

I would like to count the total occurences take place of each type in column 'c'. I would then like to return the value of column 'b' for the values in column 'c' that have a count total of '1'.

The expected output is soemthing like this:

           c    b   total_count
    0   bg10    2   2
    0   ng45    2   2
    0   fg56    2   5
    1   cv10    5   1
    1   fg56    5   5

I have tried the 'Collections' library, and a 'for' loop (I understand its not best practise in Pandas) but i think i'm missing some fundamental udnerstanding of lists within cells, and how to perform analysis like these.

Thank you for taking my question into consideration.

Ventoii
  • 46
  • 7
  • Hi, you can use `apply` function on the `c` column and combine it with `Counter` object. See this link for more help: [how can i count the occurrences of a list item](https://stackoverflow.com/questions/2600191/how-can-i-count-the-occurrences-of-a-list-item) – DavidDr90 Aug 06 '20 at 13:08
  • Hello, I have tried this but it doesnt return any data. Could you p[lease sahare a written example of applying this type of function? Thank you. – Ventoii Aug 06 '20 at 13:32
  • @DavidDr90 I added the point that i'd liek to return the total occurences adding up each row also. – Ventoii Aug 06 '20 at 13:36
  • 1
    Please specify your expected sample output so its clear for us to provide a proper solution. Is the `total` occurences to be found based on individual row lists or the entire column should be taken to consideration? – Anant Kumar Aug 06 '20 at 14:21

1 Answers1

0

I would use apply the following way:

first I create the df:

df1=pd.DataFrame({"b":[2,5,8,8], "c":[['bg10', 'ng45', 'fg56'],['cv10', 'fg56'],['bg10', 'ng45', 'fg56'],['fg56', 'fg56']]})

next use apply to count the number of (non unique) items in a list and save it in a different column:

df1["count_c"]=df1.c.apply(lambda x: len(x))

you will get the following:

     b             c        count_c
  0  2  [bg10, ng45, fg56]        3
  1  5        [cv10, fg56]        2
  2  8  [bg10, ng45, fg56]        3
  3  8        [fg56, fg56]        2

to get the lines when c larger than threshold:`

df1[df1["count_c"]>2]["b"]

note: if you want to count only unique values in each list in column c you should use:

df1["count_c"]=df1.c.apply(lambda x: len(set(x)))

EDIT


in order to count the total number of each item I would try this: first let's "unpack all the lists into columns

new_df1=(df1.c.apply(lambda x: pd.Series(x))).stack().reset_index(level=1,drop=True).to_frame("c").join(df1[["b"]],how="left")

then get the total counts of each item in the list and add it to a new col:

  counts_dict=new_df1.c.value_counts().to_dict()
  new_df1["total_count_c"]=new_df1.c.map(counts_dict)
  new_df1.head()
  
           c    b   total_count_c
    0   bg10    2   2
    0   ng45    2   2
    0   fg56    2   5
    1   cv10    5   1
    1   fg56    5   5
  
Rachel Shalom
  • 399
  • 2
  • 10