I need to run a statistic on some data. See how many times a values "j" is next to a value "i". The code that I put hereafter is a gross simplification of what I need to to, but it contains the problem I have.
Let's say that you have this data frame.
import numpy as np
import pandas as pd
a_df=pd.DataFrame({"a_col":np.random.randint(10, size=1000), "b_col":np.random.randint(10, size=1000)})
I generate a matrix that will contain our statistics:
res_matrix=np.zeros((10, 10))
by looking at res_matrix[i][j] we will know how many times the number "j" was next to the number "i" in our data frame.
I know that "for loops" are bad in pandas, but again, this is a simplification. I generate a sub-table for the value "i" and on this table I ran "value_counts()" on the column "b_col".
for i in a_df["a_col"].unique():
temp_df=a_df[a_df["a_col"]==i]
table_count=temp_df["b_col"].value_counts()
for val,cnt in table_count.iteritems():
res_matrix[i][val]+=int(cnt)
is there an efficient way to populate res_matrix without changing the topmost for loop? I am thinking something like list comprehension, but I cannot wrap my mind around it.
Please, focus ONLY on these two lines:
for val,cnt in table_count.iteritems():
res_matrix[i][val]+=int(cnt)
I can't use groupby because my project requires many more operations on the dataframe.