How to count dataframe from both columns

Question

I have dataframe and want to count how many times the name appears in both columns

data=pd.DataFrame({'TEAM 1':['Mark','Peter','Andy','Tony'],'Team 2':
                  ['Andy','Tony','Jhon','Peter']})

so the name Andy would count as 2, while Jhon is 1
Expected Output

Mark 1
Andy 2
Tony 2
Jhon 1
Peter 2

I used this code but it does not work

data.groupby('TEAM 1')['Team 2'].count()

Sayandip Dutta · Answer 1 · 2020-08-02T18:31:48.447

8

Use stack and value_counts:

>>> data.stack().value_counts()
Andy     2
Tony     2
Peter    2
Jhon     1
Mark     1

As pointed out in Ch3steR's comment, converting the df to numpy.array then flattening it using ravel before calling pd.value_counts yields around 2X faster results:

>>> pd.value_counts(data.to_numpy().ravel())
Andy     2
Tony     2
Peter    2
Jhon     1
Mark     1
dtype: int64

Benchmarking:

>>> data = pd.concat([data] * 1000000)   # 4_000_000 rows

>>> %timeit data.stack().value_counts()
1.21 s ± 27.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit pd.value_counts(data.to_numpy().ravel())
667 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

UPDATE: As per anky's comment collections.Counter proves to be even faster:

>>> %timeit pd.Series(Counter(np.ravel(data)))
501 ms ± 4.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

edited Aug 02 '20 at 18:31

answered Aug 02 '20 at 07:14

Sayandip Dutta

15,602
4
23
52

1

Nice answer, alternative using *NumPy* `n,c = np.unique(df.to_numpy(), return_counts=True); pd.Series(c, index=n)` don't if this is efficient or not. – Ch3steR Aug 02 '20 at 08:09
From [this post](https://stackoverflow.com/a/43096495/12416453) `np.unique` over large array is slower than `pd.value_counts`. – Ch3steR Aug 02 '20 at 08:19
`pd.value_counts(df.to_numpy().ravel())` was 2x faster than using `df.stack().value_counts()` over df of shape (400K,2) – Ch3steR Aug 02 '20 at 08:25
`np.unique` is much slower for larger dataframes, I will add the `ravel` one to the answer if you don't mind. Or if you want to add that as an answer, that's fine as well. – Sayandip Dutta Aug 02 '20 at 08:30
1

Sure, commented for that ;) `np.unique` is slower for large dataframes [here's bencmarking](https://stackoverflow.com/a/43096495/12416453) – Ch3steR Aug 02 '20 at 08:32
`%timeit Counter(np.ravel(data))` proves even faster for me after you have imported the same `from collections import Counter` – anky Aug 02 '20 at 18:00

How to count dataframe from both columns

1 Answers1