0

I have dataframe and want to count how many times the name appears in both columns

data=pd.DataFrame({'TEAM 1':['Mark','Peter','Andy','Tony'],'Team 2':
                  ['Andy','Tony','Jhon','Peter']})

so the name Andy would count as 2, while Jhon is 1
Expected Output

Mark 1
Andy 2
Tony 2
Jhon 1
Peter 2

I used this code but it does not work

data.groupby('TEAM 1')['Team 2'].count()

hansonmbbop
  • 83
  • 12

1 Answers1

8

Use stack and value_counts:

>>> data.stack().value_counts()
Andy     2
Tony     2
Peter    2
Jhon     1
Mark     1

As pointed out in Ch3steR's comment, converting the df to numpy.array then flattening it using ravel before calling pd.value_counts yields around 2X faster results:

>>> pd.value_counts(data.to_numpy().ravel())
Andy     2
Tony     2
Peter    2
Jhon     1
Mark     1
dtype: int64

Benchmarking:

>>> data = pd.concat([data] * 1000000)   # 4_000_000 rows

>>> %timeit data.stack().value_counts()
1.21 s ± 27.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit pd.value_counts(data.to_numpy().ravel())
667 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

UPDATE: As per anky's comment collections.Counter proves to be even faster:

>>> %timeit pd.Series(Counter(np.ravel(data)))
501 ms ± 4.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sayandip Dutta
  • 15,602
  • 4
  • 23
  • 52
  • 1
    Nice answer, alternative using *NumPy* `n,c = np.unique(df.to_numpy(), return_counts=True); pd.Series(c, index=n)` don't if this is efficient or not. – Ch3steR Aug 02 '20 at 08:09
  • From [this post](https://stackoverflow.com/a/43096495/12416453) `np.unique` over large array is slower than `pd.value_counts`. – Ch3steR Aug 02 '20 at 08:19
  • `pd.value_counts(df.to_numpy().ravel())` was 2x faster than using `df.stack().value_counts()` over df of shape (400K,2) – Ch3steR Aug 02 '20 at 08:25
  • `np.unique` is much slower for larger dataframes, I will add the `ravel` one to the answer if you don't mind. Or if you want to add that as an answer, that's fine as well. – Sayandip Dutta Aug 02 '20 at 08:30
  • 1
    Sure, commented for that ;) `np.unique` is slower for large dataframes [here's bencmarking](https://stackoverflow.com/a/43096495/12416453) – Ch3steR Aug 02 '20 at 08:32
  • `%timeit Counter(np.ravel(data))` proves even faster for me after you have imported the same `from collections import Counter` – anky Aug 02 '20 at 18:00