0

I need to run a statistic on some data. See how many times a values "j" is next to a value "i". The code that I put hereafter is a gross simplification of what I need to to, but it contains the problem I have.

Let's say that you have this data frame.

import numpy as np
import pandas as pd

a_df=pd.DataFrame({"a_col":np.random.randint(10, size=1000), "b_col":np.random.randint(10, size=1000)})

I generate a matrix that will contain our statistics:

res_matrix=np.zeros((10, 10))

by looking at res_matrix[i][j] we will know how many times the number "j" was next to the number "i" in our data frame.

I know that "for loops" are bad in pandas, but again, this is a simplification. I generate a sub-table for the value "i" and on this table I ran "value_counts()" on the column "b_col".

for i in a_df["a_col"].unique():
    temp_df=a_df[a_df["a_col"]==i]
    table_count=temp_df["b_col"].value_counts()
    for val,cnt in table_count.iteritems():
        res_matrix[i][val]+=int(cnt)

is there an efficient way to populate res_matrix without changing the topmost for loop? I am thinking something like list comprehension, but I cannot wrap my mind around it.

Please, focus ONLY on these two lines:

    for val,cnt in table_count.iteritems():
        res_matrix[i][val]+=int(cnt)

I can't use groupby because my project requires many more operations on the dataframe.

Fabrizio
  • 927
  • 9
  • 20
  • The answers this question is receiving appear to be duplicates of the answers here: [Groupby value counts on the dataframe pandas](https://stackoverflow.com/q/39132742/15497888) – Henry Ecker Oct 24 '21 at 15:28
  • The problem is that everybody is suggesting a groupby solution, whereas I explicitly said that I need to keep the first for loop and optimize the res_matrix population. I can't use the solutions with groupby because my example is a simplification of the real project I am working on. – Fabrizio Oct 24 '21 at 15:52
  • 1
    Which is why I did not close this as a duplicate despite the answers you are receiving. You may want to reach out to your answerers and explain that. – Henry Ecker Oct 24 '21 at 15:53
  • 1
    @Fabrizio Sorry it wasn't immediately clear from the question... Added an option to do it while keeping the outer loop. See the *Update* part in my answer – perl Oct 25 '21 at 18:08

2 Answers2

2

There's a function crosstab in pandas that does just that:

pd.crosstab(a_df['a_col'], a_df['b_col'])

Output:

b_col   0   1   2   3   4   5   6   7   8   9
a_col                                        
0      10  10  10  12  14   9  10   5  13  16
1      16   9  13  14  14   8   4  11   9  12
2      10   8  12  13   9  12  13   7  10   5
3      11   7  10  17   6   9   6   8   7  14
4       9   8   4   5   7  13  12   8  11   6
5      14   9   8  15   6  10  12   9   7   9
6      11  13  10   9   7   5   8  11  13  21
7       8   9  11   8   8  10  11  15  10  12
8       6  17  11   4  12   9   6  10  10  13
9      12   6  14   3  11  11   7   5  14  14

Update: if the outer loop must remain for other reasons, you can set values in res_matrix inside the loop:

res_matrix = np.zeros((10, 10))

for i in a_df["a_col"].unique():
    temp_df=a_df[a_df["a_col"]==i]
    table_count=temp_df["b_col"].value_counts()

    # set values in res_matrix
    res_matrix[i, table_count.index] = table_count
perl
  • 9,826
  • 1
  • 10
  • 22
0

Don't loop, this is slow. If you think there is a good reason to loop, please explain it and provide an appropriate example.

Here is another method.

You can groupby both columns and get the group size, then unstack to get a 2D shape:

a_df.groupby(['a_col', 'b_col']).size().unstack()

output:

b_col   0   1   2   3   4   5   6   7   8   9
a_col                                        
0      16   2   4  11   9  13  11  11   8   6
1      10  12   7   6   6  11  10   8   2  12
2       9  12  10  22  12  13   8  11   9   8
3      13  11  11  14   7  11   9   7   8  14
4      14   7  17   5   8   6  15   8  11   8
5      10  12   7  14   6  16  11  12   6   8
6      13  10   9  12  11  14   8  10   6   8
7       9  12  12   9  11   9   8  14   5  12
8       7   8   9   8  10  14   9   8   8  18
9      13   6  13  11  13  11   8   7  11  11
mozway
  • 194,879
  • 13
  • 39
  • 75