2

I have a dataframe called 'bal'. It looks like this:

              ano   id   unit period
business_id                         
9564         2012  302  sdasd  anual
9564         2011  303  sdasd  anual
2361         2013  304  sdasd  anual
2361         2012  305  sdasd  anual
...

I'm running the following code on it:

bal=bal.merge(bal.pivot(columns='ano', values='id'),right_index=True,left_index=True)

My intention is to turn that into something like this:

               ano    id  unit    period  2006  2007  2008  2009  2010  \
 business_id                                                                     

 72           2013   774  sdasd   anual   NaN   NaN   NaN   NaN   NaN   

 72           2012   775  sdasd   anual   NaN   NaN   NaN   NaN   NaN   

 74           2012  1120  sdasd   anual   NaN   NaN   NaN   NaN   NaN   

 119          2013   875  sdasd   anual   NaN   NaN   NaN   NaN   NaN   

 119          2012   876  sdasd   anual   NaN   NaN   NaN   NaN   NaN   

 ...

When I that code, I get this error:

ValueError: Index contains duplicate entries, cannot reshape

So to avoid duplicates, I added a drop_duplicates line:

bal=bal.drop_duplicates()
bal=bal.merge(bal.pivot(columns='ano', values='id'),right_index=True,left_index=True)

When I run the code, voilá, I get the same problem:

ValueError: Index contains duplicate entries, cannot reshape

Am I doing something wrong or misunderstanding something?

EDIT

bal is a dataframe I'm creating out of a SQL using the following code:

bal=pd.read_sql('select * from table;',connection).set_index('business_id')[['ano','id','unit','period']]

The weird thing is that if I limit the SQL query it works fine:

bal=pd.read_sql('select * from table limit 1000;',connection).set_index('business_id')[['ano','id','unit','period']]

I thought that the problem could be related to the fact that the index has a lot of duplication (as you can see in that example above). However if I print(bal.head(4)) in this limited bal it looks exactly the same as the one you can see above, with indexes that repeat.

Logica
  • 977
  • 4
  • 16
aabujamra
  • 4,494
  • 13
  • 51
  • 101
  • can you give a bit more information on what `bal` contains? screenshot `bal.head()` perhaps – nanojohn Aug 22 '17 at 19:21
  • just did it @nanojohn – aabujamra Aug 22 '17 at 19:31
  • 2
    Thanks, can you describe what you are trying to accomplish with this `merge` on a `pivot` of itself. It's not clear. The reason you are have an error is because you need to drop duplicates based on the index, whereas the `.drop_duplicates()` function works by dropping rows where all columns contain the same value – nanojohn Aug 22 '17 at 19:33
  • Thanks @nanojohn, but I believe that's not exactly the problem. I'm creating bal out of a SQL with a query select * from table. However, if I do select * from table limit 1000 it somehow works fine (and in this case it does repeat the index also). I will add that info to the question. – aabujamra Aug 22 '17 at 19:40
  • @abutremutante, can you post your desired data set? – MaxU - stand with Ukraine Aug 22 '17 at 19:47
  • @MaxU just did it – aabujamra Aug 22 '17 at 19:53

2 Answers2

3

UPDATE2:

qry = "select distinct business_id,ano,id,unit,period from table where period='anual'"
bal=pd.read_sql(qry, connection, index_col=['business_id'])

assume we get the following DF (still with duplicated values in the ano column):

In [167]: bal
Out[167]:
              ano   id   unit period
business_id
9564         2012  302  sdasd  anual
9564         2012  299  sdasd  anual
9564         2011  303  sdasd  anual
2361         2013  304  sdasd  anual
2361         2012  305  sdasd  anual

we can do this:

In [169]: bal.join(bal.pivot_table(index=bal.index, columns='ano',
                                   values='id', aggfunc='first'))
Out[169]:
              ano   id   unit period   2011   2012   2013
business_id
2361         2013  304  sdasd  anual    NaN  305.0  304.0
2361         2012  305  sdasd  anual    NaN  305.0  304.0
9564         2012  302  sdasd  anual  303.0  302.0    NaN
9564         2012  299  sdasd  anual  303.0  302.0    NaN
9564         2011  303  sdasd  anual  303.0  302.0    NaN

UPDATE:

consider the following sample DF:

In [161]: bal
Out[161]:
              ano   id   unit period
business_id
9564         2012  302  sdasd  anual
9564         2012  299  sdasd  anual   # i've intentionally added this row with duplicated `ano`
9564         2011  303  sdasd  anual
2361         2013  304  sdasd  anual
2361         2012  305  sdasd  anual

reproducing your error:

In [162]: bal.pivot(columns='ano', values='id')
...
skipped
...
ValueError: Index contains duplicate entries, cannot reshape

Old answer:

Is that what you want?

In [144]: bal.join(bal.pivot(columns='ano', values='id'))
Out[144]:
              ano   id   unit period   2011   2012   2013
business_id
2361         2013  304  sdasd  anual    NaN  305.0  304.0
2361         2012  305  sdasd  anual    NaN  305.0  304.0
9564         2012  302  sdasd  anual  303.0  302.0    NaN
9564         2011  303  sdasd  anual  303.0  302.0    NaN
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • that's it @MaxU, but if I use that same join sentence you used the same thing happens: ValueError: Index contains duplicate entries, cannot reshape – aabujamra Aug 22 '17 at 19:49
  • 1
    @abutremutante, the problem is that you have duplicates in the `ano` column. Try to add the following row to your sample df: `[9564 2012 299 sdasd anual]` This drives `pivot` crazy as it doesn't know what to do with the duplicated column names... – MaxU - stand with Ukraine Aug 22 '17 at 19:55
  • I think you nailed it. drop_duplicates shouldn't kill those dups? – aabujamra Aug 22 '17 at 19:59
  • @abutremutante, you will need to decide which rows should remain... What kind of SQL DB are you using? It would be easier/faster to do that on the SQL side... – MaxU - stand with Ukraine Aug 22 '17 at 20:00
  • I'm an SQL rookie. Could you post how would that SQL query look like? – aabujamra Aug 22 '17 at 20:09
  • @abutremutante, will do after you will say what kind of SQL DB you are using... ;-) Could you also post a desired data set for the sample data set from my updated answer (which reproduces your error)? – MaxU - stand with Ukraine Aug 22 '17 at 20:10
  • sure: mysql Ver 14.14 Distrib 5.7.16. The desired data set would be the one in your old answer however I would first filter only the rows where period == anual (to avoid dups) and to the remaining dups (that do exist, just tested here) I would just flip a coin to choose which one to keep (first, last, random, any, as long as we kill the dups) – aabujamra Aug 22 '17 at 20:35
2

Consider using unstack() and merge() - this will take care of the duplicate issue.

# sample data
data = {"business_id":[9564, 9564, 2361, 2361],
        "ano":[2012, 2011, 2013, 2012],
        "id":[302,303,304,305],
        "unit":["sdasd"]*4,
        "period":["anual"]*4}
df = pd.DataFrame(data)
# include ano for MultiIndex
df.set_index(["business_id","ano"], inplace=True)

df
                   id period   unit
business_id ano                    
9564        2012  302  anual  sdasd
            2011  303  anual  sdasd
2361        2013  304  anual  sdasd
            2012  305  anual  sdasd

Now unstack(), grab the id data, and merge(). The inner-most level is unstacked, which is why we added ano to the index above.

df.merge(df.unstack()['id'], right_index=True, left_index=True)
                   id period   unit   2011   2012   2013
business_id ano                                         
9564        2012  302  anual  sdasd  303.0  302.0    NaN
            2011  303  anual  sdasd  303.0  302.0    NaN
2361        2013  304  anual  sdasd    NaN  305.0  304.0
            2012  305  anual  sdasd    NaN  305.0  304.0
andrew_reece
  • 20,390
  • 3
  • 33
  • 58