Create dataframe from itertools product

Question

I have two lists:

a = [1,2,3]
b = [4,5,6]

I want to create a dataframe whereby each combination of (a,b) generates a dataframe X and I pick out the max value of X, with the resulting output rows/columns with the elements in a and b.

df=[]

for i, j in itertools.product(a, b):
    X = do_something(i,j)  ## this is a dataframe
    x_value = X.max()
    df.append(i,j,x_value)

df=pd.DataFrame(df, columns=['a', 'b', 'x_value'])

The output dataframe should have columns as a, rows as b, and values as x_value.

Does `func` take scalar `i` and `j`? So it has to be applied iteratively either before or after creating the dataframe? — hpaulj, Aug 06 '18 at 17:25
No the function is very complicated, but the end result for each iteration (i, j) is a dataframe with many columns being produced. I then choose a value from a column — user44840, Aug 06 '18 at 17:26
There are two issues, 1) generating `x_value` for the cartesian product of `a` and `b`, and 2) arranging the values in a Dataframe with `a` and `b` columns and rows. Your code does 1) fine, but makes a different dataframe, one with 3 columns and 9 rows. But the data is all there. — hpaulj, Aug 06 '18 at 22:19

pault · Accepted Answer · 2018-08-06T17:38:11.190

2

IIUC, you want to know how to go from a list of (i, j, x) values to a DataFrame where i corresponds to the columns, j the index, and x the value:

For example, if you had:

a = [1,2,3]
b = [4,5,6]
func = lambda i, j: i+j
result = [(i, j, func(i,j)) for i, j in itertools.product(a, b)]
print(result)
#[(1, 4, 5),
# (1, 5, 6),
# (1, 6, 7),
# (2, 4, 6),
# (2, 5, 7),
# (2, 6, 8),
# (3, 4, 7),
# (3, 5, 8),
# (3, 6, 9)]

One way to turn this into a DataFrame is to use collections.defaultdict:

from collections import defaultdict

d = defaultdict(list)

for i, j, x in result:
    d[i].append(x)

df = pd.DataFrame(d, index=b)
print(df)
#   1  2  3
#4  5  6  7
#5  6  7  8
#6  7  8  9

edited Aug 06 '18 at 17:38

answered Aug 06 '18 at 17:09

pault

41,343
15
107
149

I'm simplifying the func(x), what if x is more complicated then just a simple addition? – user44840 Aug 06 '18 at 17:10
It has to be itertools, as each (i,j) generates a dataframe (temp) and I pick out a particular value in temp – user44840 Aug 06 '18 at 17:18
Every (i, j) creates a dataframe X, which I pick out the max value – user44840 Aug 06 '18 at 17:25
@user44840 I've modified my answer based on your latest updates – pault Aug 06 '18 at 19:38
Fits best with the aim of creating a 2d df – user44840 Aug 07 '18 at 11:35

score 2 · Answer 2 · answered Aug 06 '18 at 17:09

2

IIUC

df=pd.DataFrame(columns=a,index=b)
df.apply(lambda x : x.index+x.name)
Out[189]: 
   1  2  3
4  5  6  7
5  6  7  8
6  7  8  9

answered Aug 06 '18 at 17:09

BENY

317,841
20
164
234

score 0 · Answer 3 · answered Aug 06 '18 at 17:11

0

You can avoid the use of itertools.product while achieving the same functionality by using numpy and broadcasting:

a = [1,2,3]
b = [4,5,6]
arr = np.array(a).reshape(-1, 1) + np.array(b).reshape(1, -1)
df = pd.DataFrame(arr, columns=a, index=b)

answered Aug 06 '18 at 17:11

PMende

5,171
2
19
26

score 0 · Answer 4 · answered Aug 07 '18 at 04:33

In [134]: a=[1,2,3]
In [135]: b=[4,5,6]

Your list of 'indices' and values:

In [140]: alist = []
In [142]: for i,j in itertools.product(a,b):
     ...:     v = i*2 + j*.5
     ...:     alist.append([i,j,v])
     ...:     
In [143]: alist
Out[143]: 
[[1, 4, 4.0],
 [1, 5, 4.5],
 [1, 6, 5.0],
 [2, 4, 6.0],
 [2, 5, 6.5],
 [2, 6, 7.0],
 [3, 4, 8.0],
 [3, 5, 8.5],
 [3, 6, 9.0]]

A 3 column dataframe from that:

In [144]: df = pd.DataFrame(alist, columns=['a','b','value'])
In [145]: df
Out[145]: 
   a  b  value
0  1  4    4.0
1  1  5    4.5
2  1  6    5.0
3  2  4    6.0
4  2  5    6.5
5  2  6    7.0
6  3  4    8.0
7  3  5    8.5
8  3  6    9.0

One way of using the same data to make 'grid' dataframe:

In [147]: pd.DataFrame(np.array(alist)[:,2].reshape(3,3), columns=a, index=b)
Out[147]: 
     1    2    3
4  4.0  4.5  5.0
5  6.0  6.5  7.0
6  8.0  8.5  9.0

Oops that maps the rows and columns wrong; lets transpose the 3x3 array:

In [149]: pd.DataFrame(np.array(alist)[:,2].reshape(3,3).T, columns=a, index=b)
Out[149]: 
     1    2    3
4  4.0  6.0  8.0
5  4.5  6.5  8.5
6  5.0  7.0  9.0

I know numpy well; my experience with pandas is limited. I'm sure there are other ways of constructing such a frame. My guess is that if your value function is complex enough, the iteration mechanism will have a minor effect on the overall run time. Simply evaluating your function for each cell will take up most of the time.

If your function can be written to take arrays, rather than scalars, then the values can be easily calculated with out iteration. For example:

In [171]: I,J = np.meshgrid(b,a,indexing='ij')
In [172]: X = J*2 + I*.5
In [173]: X
Out[173]: 
array([[4. , 6. , 8. ],
       [4.5, 6.5, 8.5],
       [5. , 7. , 9. ]])
In [174]: I
Out[174]: 
array([[4, 4, 4],
       [5, 5, 5],
       [6, 6, 6]])
In [175]: J
Out[175]: 
array([[1, 2, 3],
       [1, 2, 3],
       [1, 2, 3]])

Create dataframe from itertools product

4 Answers4

Linked