0

I have a data frame like this.

c_name      p_name    
  A           X
  B           Y
  B           A1
  C           ZX
  D           G4
  D           H9

I want frequency for each c_name with each p_name. I am getting proper output in R by using

data.frame(table(df1$c_name,df1$p_name))

But in python if I am applying pd.crosstab(df1['c_name'],df1['p_name']) it is giving the result but not in proper format.

My expectation is:

c_name      p_name    Freq
  A           X        1
  B           X        0
  B           X        0
  C           X        0
  D           X        0
  D           X        0
  A           Y        0
  B           Y        1
  B           Y        0
  C           Y        0
  D           Y        0
  D           Y        0  ..........so on.

Thanks in advance.

M--
  • 25,431
  • 8
  • 61
  • 93
tom
  • 35
  • 1
  • 6
  • Cross tabulation should give you frequencies per group, which is probably not what you want. Have you tried **pandas.Series.value_counts** ? – noumenal Apr 13 '17 at 10:17
  • Possible duplicate of [Frequency table for a single variable](http://stackoverflow.com/questions/12207326/frequency-table-for-a-single-variable) – noumenal Apr 13 '17 at 10:17
  • Are you looking also for a solution in R (or only in Python)? – jogo Apr 13 '17 at 10:58

2 Answers2

2
pd.crosstab(df['c_name'], df['p_name']).stack().reset_index(name='Freq')

This will give:

   c_name p_name  Freq
0       A     A1     0
1       A     G4     0
2       A     H9     0
3       A      X     1
4       A      Y     0
5       A     ZX     0
6       B     A1     1
7       B     G4     0
8       B     H9     0
9       B      X     0
10      B      Y     1
11      B     ZX     0
12      C     A1     0
13      C     G4     0
14      C     H9     0
15      C      X     0
16      C      Y     0
17      C     ZX     1
18      D     A1     0
19      D     G4     1
20      D     H9     1
21      D      X     0
22      D      Y     0
23      D     ZX     0
languitar
  • 6,554
  • 2
  • 37
  • 62
0

Solution with groupby and size, if need also missing values add reindex and replace them with 0:

mux = pd.MultiIndex.from_product([df1['c_name'], df1['p_name']], names=['c_name','p_name'])
df1 = df1.groupby(['c_name','p_name']).size()
         .reindex(mux, fill_value=0).reset_index(name='Freq')

print (df1)
   c_name p_name  Freq
0       A      X     1
1       A      Y     0
2       A     A1     0
3       A     ZX     0
4       A     G4     0
5       A     H9     0
6       B      X     0
7       B      Y     1
8       B     A1     1
9       B     ZX     0
10      B     G4     0
11      B     H9     0
12      B      X     0
13      B      Y     1
14      B     A1     1
15      B     ZX     0
16      B     G4     0
17      B     H9     0
18      C      X     0
19      C      Y     0
20      C     A1     0
21      C     ZX     1
22      C     G4     0
23      C     H9     0
24      D      X     0
25      D      Y     0
26      D     A1     0
27      D     ZX     0
28      D     G4     1
29      D     H9     1
30      D      X     0
31      D      Y     0
32      D     A1     0
33      D     ZX     0
34      D     G4     1
35      D     H9     1

Timings:

Solution is faster, because no stack:

In [197]: %timeit pd.crosstab(df1['c_name'], df1['p_name']).stack().reset_index(name='Freq')
100 loops, best of 3: 6.74 ms per loop

In [198]: %timeit  df1.groupby(['c_name','p_name']).size().reindex(pd.MultiIndex.from_product([df1['c_name'], df1['p_name']], names=['c_name','p_name']), fill_value=0).reset_index(name='Freq')
100 loops, best of 3: 3.12 ms per loop
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252