2

I need help in reshaping a column(Column 'Break') in csv file that looks like this

Axe Break
1   ww
2   ee
3   qq
4   xx
5   dd
5   gg
4   hh
6   tt
9   yy
1   uu
1   ii
2   oo
5   pp
4   mm
5   kk
5   ll
7   mm
2   bb
7   pp
0   zz

into a matrix form like this

[[ww,ee,qq,xx,dd,gg,hh,tt,yy,uu],
 [ii,oo,pp,mm,kk,ll,mm,bb,pp,zz]]

using pandas.

I found a question that looks like what I want to ask here but I think that question a little different from what I want to do.

Reshaping the third column of a CSV file into a matrix

I have been going through the pandas tutorial but did not seem to find a way to do this.

Thank you for your help.

Community
  • 1
  • 1
Fang
  • 824
  • 4
  • 17
  • 32

3 Answers3

3

You can first create column for new index with cumsum, then pivot with reindex and last convert to numpy array by values:

df['g'] = (df.Axe == 1).cumsum()
df = df.pivot(index='g', columns='Axe', values='Break')
       .reindex(columns=list(range(1,10)) + [0])

print (df)
Axe   1   2   3   4   5   6   7   8   9   0
g                                          
1    ww  ee  qq  xx  dd  gg  hh  tt  yy  uu
2    ii  oo  pp  mm  kk  ll  mm  bb  pp  zz

print (df.values)
[['ww' 'ee' 'qq' 'xx' 'dd' 'gg' 'hh' 'tt' 'yy' 'uu']
 ['ii' 'oo' 'pp' 'mm' 'kk' 'll' 'mm' 'bb' 'pp' 'zz']]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you for your suggestion. It worked if there is no duplicate entries but what if there is duplicate entries?Column "Axe" has duplicate entries – Fang Feb 08 '17 at 06:53
  • Solution is similar only need change `df['g'] = (df.Axe == 1).cumsum()` to another code which return same values for each group. – jezrael Feb 08 '17 at 07:43
2

You can use reshape.

In [702]: df['Break'].reshape(2, len(df.index)/2)
Out[702]:
array([['ww', 'ee', 'qq', 'xx', 'dd', 'gg', 'hh', 'tt', 'yy', 'uu'],
       ['ii', 'oo', 'pp', 'mm', 'kk', 'll', 'mm', 'bb', 'pp', 'zz']], dtype=object)
Zero
  • 74,117
  • 18
  • 147
  • 154
  • Thank you for your suggestion. It worked but I did some other modification to add another row. (In the question it is only 2 row). I want to add another row so it becomes 3 row. I changed `reshape(2, len(df.index)/2` to `reshape(3, len(df.index)/3) but it return an error saying `ValueError: total size of new array must be unchanged` . What exactly the number 2 means? – Fang Feb 08 '17 at 06:47
  • In `reshape(m, n)`, `m*n` equals number of elements. So, you need to pick the distribution accordingly. – Zero Feb 08 '17 at 06:50
  • Thank you for your explanation. I can manipulate the shape now. – Fang Feb 08 '17 at 07:01
  • @Fang `reshape(m, -1)` will ensure that the `m` is respected and the `-1` indicates that the other dimension will be what it needs to be. – piRSquared Feb 08 '17 at 07:52
2

Using the values attribute drops this to numpy and then the reshape parameters can take -1 in the dimension that needs to be determined.

df.Break.values.reshape(2, -1)

array([['ww', 'ee', 'qq', 'xx', 'dd', 'gg', 'hh', 'tt', 'yy', 'uu'],
       ['ii', 'oo', 'pp', 'mm', 'kk', 'll', 'mm', 'bb', 'pp', 'zz']], dtype=object)
piRSquared
  • 285,575
  • 57
  • 475
  • 624