I have a big pandas dataframe DF which takes up nearly all my RAM. Its index is always one of the following: 'c1','c2','c3'.
0 1 2 ... n
i
c1 1 8 15
c2 2 9 16
c1 3 10 17
c3 4 11 18
c2 5 12 19
c1 6 13 20
c3 7 14 21
...
I also have a dictionary in which the keys 'match' the dataframe index labels. The values are lists/arrays with length n (equal to number of dataframe columns).
d = {'c1':[10,10,10,...],'c2':[100,100,100,...],'c3':[1000,1000,1000,...]}
len(d[<allkeys>]) == n >>> True
What I want to do is a simple dataframe multiplication based on an 'attribute' of the row -- in this case it's its index, but it could be a value from a column. The result would look like this:
0 1 2
i
c1 10 80 150
c2 200 900 1600
c1 30 100 170
c3 4000 11000 18000
c2 500 1200 1900
c1 60 130 200
c3 7000 14000 21000
I don't want to build a second dataframe DF2 in a way that I could just do DF*DF2, because I don't have enough RAM and because such dataframe seems pointless to me, meaning that it would just be a repetition of arrays/lists (please correct me if I'm wrong).
I thought about having a second dataframe like this:
0 1 2
c1 10 10 10
c2 100 100 100
c3 1000 1000 1000
But when I do DF * DF2, somehow the index gets sorted and I don't know how this can be avoided. Result:
0 1 2
c1 100 800 1500
c1 300 1000 1700
c1 600 1300 2000
c2 20000 90000 160000
c2 50000 120000 190000
c3 4000000 11000000 18000000
c3 7000000 14000000 21000000
If you think there's another way to approach this problem (different data structures,libraries,etc) that would also be appreciated.
Code for samples:
df = pd.DataFrame({0: [1, 2, 3, 4, 5, 6, 7],1: [8, 9, 10, 11, 12, 13, 14], 2: [15, 16, 17, 18, 19, 20, 21]}, index=pd.Index(['c1', 'c2', 'c1', 'c3', 'c2', 'c1', 'c3'], dtype='object', name='i'))
d = {'c1':[10,10,10],'c2':[100,100,100],'c3':[1000,1000,1000]}
df2 = pd.DataFrame.from_dict(d,orient='index')