In Pandas, how do I create a dataframe from a count of items in a column that are separated by commas?

Question

In python3 and pandas I have a dataframe which contains for each line informations about legal proceedings.

The column "nome" has names of people, the "tipo" column has the types of lawsuits, only two types INQ and AP.

And column "resumo" has crimes investigated for prosecution in court proceedings. But each legal process may consist of one or more crimes. And the crimes are separated by ",":

Peculato,           Lavagem de Dinheiro
Corrupção passiva,  Ocultação de bens, Lavagem de dinheiro
Corrupção passiva,  Lavagem de dinheiro, Crimes Eleitorais
Crimes Eleitorais,  Lavagem de dinheiro
Peculato
Quadrilha ou Bando, Crimes da Lei de licitações, Peculato

I need to count:

For each name
Divided by INQ and AP processes
The appearance of each individual crime between ","

Taking the example above the "resumo" column, and assuming they all concern the person "John Doe".

The first two lines are of type AP and the remaining INQ, then John Doe has:

1 AP for Peculato
2 AP for Lavagem de dinheiro
1 AP for Corrupção passiva
1 AP for Ocultação de bens

1 INQ for Corrupção passiva
2 INQ for Lavagem de dinheiro
2 INQ for Crimes Eleitorais
2 INQ for Peculato
1 INQ for Quadrilha ou Bando
1 INQ for Crimes da Lei de licitações

A sample of the rows look like

df_selecao_atual[['tipo', 'resumo', 'nome']].head(5).to_dict()
{'tipo': {2: 'INQ', 3: 'AP', 4: 'INQ', 5: 'INQ', 6: 'AP'},
 'resumo': {2: 'Desvio de verbas públicas',
  3: 'Desvio de verbas públicas',
  4: nan,
  5: 'Prestação de contas rejeitada',
  6: 'Peculato, Gestão fraudulenta'},
 'nome': {2: 'CÉSAR MESSIAS',
  3: 'CÉSAR MESSIAS',
  4: 'FLAVIANO MELO',
  5: 'FLAVIANO MELO',
  6: 'FLAVIANO MELO'}}

On this database I already had an answer that worked very well in this link: In pandas, how to count items between commas, dividing between column types?

But now I need to not only show on the screen, but create a dataframe. Like this:

nome                tipo    resumo              count
Fulano de tal       INQ     Peculato            4
Fulano de tal       INQ     Ocultação de Bens   1
Fulano de tal       INQ     Corrupção ativa     2
Fulano de tal       INQ     Investigação Penal  3
Fulano de tal       AP      Peculato            1
Fulano de tal       AP      Corrupção passiva   2
Beltrano da Silva   INQ     Peculato            2
Beltrano da Silva   INQ     Lavagem de dinheiro 5
Beltrano da Silva   AP      Lavagem de dinheiro 1

Please, does anyone know how I could create this dataframe?

jezrael · Accepted Answer · 2018-08-17T14:00:12.643

You can create another DataFrame by split resumo column and add to original by join, then for counting use groupby with size:

s = (df.pop('resumo').str.split(',', expand=True)
       .stack()
       .reset_index(level=1, drop=True)
       .rename('resumo'))


df = df.join(s).groupby(['nome','tipo','resumo']).size().reset_index(name='count')
print (df)
            nome tipo                         resumo  count
0  CÉSAR MESSIAS   AP      Desvio de verbas públicas      1
1  CÉSAR MESSIAS  INQ      Desvio de verbas públicas      1
2  FLAVIANO MELO   AP             Gestão fraudulenta      1
3  FLAVIANO MELO   AP                       Peculato      1
4  FLAVIANO MELO  INQ  Prestação de contas rejeitada      1

If want use Counter solution with last solution:

s = df.dropna().groupby(['nome', 'tipo']).resumo.agg(', '.join).str.split(', ').agg(Counter)
print (s)
nome           tipo
CÉSAR MESSIAS  AP              {'Desvio de verbas públicas': 1}
               INQ             {'Desvio de verbas públicas': 1}
FLAVIANO MELO  AP      {'Peculato': 1, 'Gestão fraudulenta': 1}
               INQ         {'Prestação de contas rejeitada': 1}
Name: resumo, dtype: object

df2 = (pd.DataFrame(s.values.tolist(), index=s.index)
         .stack()
         .astype(int)
         .reset_index(name='count')
         .rename(columns={'level_2':'resumo'}))
print (df2)
            nome tipo                         resumo  count
0  CÉSAR MESSIAS   AP      Desvio de verbas públicas      1
1  CÉSAR MESSIAS  INQ      Desvio de verbas públicas      1
2  FLAVIANO MELO   AP             Gestão fraudulenta      1
3  FLAVIANO MELO   AP                       Peculato      1
4  FLAVIANO MELO  INQ  Prestação de contas rejeitada      1

Thank you very much. But I had an error executing it. I put it up — Reinaldo Chaves, Aug 17 '18 at 15:07
Thank you again @jezrael. I have now seen a small problem: script differentiates strings when there is space before - like " Corrupção ativa" and "Corrupção ativa" — Reinaldo Chaves, Sep 04 '18 at 21:42
Please, is there a way to eliminate this space and count correctly? — Reinaldo Chaves, Sep 04 '18 at 21:43
@ReinaldoChaves - How working change `(df.pop('resumo').str.split(',', expand=True)` to `(df.pop('resumo').str.split(',\s*', expand=True)` - spliting by `,` and zero or more whitespaces ? — jezrael, Sep 05 '18 at 05:45

score 1 · Answer 2 · answered Aug 17 '18 at 14:00

Almost the same logic like Jez , change your string to list , then unnest the list , then we just groupby and create the count

newdf=df.set_index(['nome','tipo'])['resumo'].str.split(',').apply(pd.Series).stack().to_frame('resumo').reset_index(level=[0,1])
newdf['count']=newdf.groupby(['nome','tipo','resumo'])['resumo'].transform('size')
newdf
Out[172]: 
            nome tipo                         resumo  count
0  CÉSAR MESSIAS  INQ      Desvio de verbas públicas      1
0  CÉSAR MESSIAS   AP      Desvio de verbas públicas      1
0  FLAVIANO MELO  INQ  Prestação de contas rejeitada      1
0  FLAVIANO MELO   AP                       Peculato      1
1  FLAVIANO MELO   AP             Gestão fraudulenta      1

In Pandas, how do I create a dataframe from a count of items in a column that are separated by commas?

2 Answers2