1

I have a table that contains the columns Date, Identifier, and Price e.g.

| Identifier |   Date   | Price |
|------------|----------|-------|
|    693477AA|1990/10/31|    100|
|    353477ZB|1991/08/31|    101|
|    123457ZB|1992/08/31|    105|

I'm using pandas read_sql function to grab the data from a SQL Server database. Either using SQL or pandas DataFrame functionality I need to get the data into the following pandas DataFrame format.

           693477AA     353477ZB     123457ZB
Date           
1988-1-1        NaN          NaN        99.41
1988-1-2     100.54          NaN        98.11
1988-1-3      99.45          NaN          NaN 

So that there is a (possibly Null) price entry for every DISTINCT Date in the table. For the set of Identifiers satisfying a condition.

Right now I have it working with a for loop,

data = []
identifiers = "SELECT DISTINCT Identifier FROM TABLE WHERE [Condition]"
for id in identifiers:
    data.append("SELECT Date, Price FROM TABLE WHERE Identifier=[id] ORDER BY DATE")
pandas.concat(data, axis=1)

However this only works for very strict [Condition] since the table is quite large (>3M rows).

How can I implement SQL, DataFrame manipulation, or a combination of both to achieve the desired format?

Thanks.

bphi
  • 3,115
  • 3
  • 23
  • 36

2 Answers2

6

we can use pivot() function:

In [144]: df.pivot(index='Date', columns='Identifier', values='Price').rename_axis(None, 1)
Out[144]:
            123457ZB  353477ZB  693477AA
Date
1990/10/31       NaN       NaN     100.0
1991/08/31       NaN     101.0       NaN
1992/08/31     105.0       NaN       NaN

or DataFrame.unstack():

In [149]: df.set_index(['Date','Identifier'])['Price'].unstack('Identifier')
Out[149]:
Identifier  123457ZB  353477ZB  693477AA
Date
1990/10/31       NaN       NaN     100.0
1991/08/31       NaN     101.0       NaN
1992/08/31     105.0       NaN       NaN

or crosstab():

In [154]: pd.crosstab(index=df['Date'], columns=df['Identifier'], 
                      values=df['Price'], aggfunc='first') \
            .rename_axis(None, 1)
Out[154]:
            123457ZB  353477ZB  693477AA
Date
1990/10/31       NaN       NaN     100.0
1991/08/31       NaN     101.0       NaN
1992/08/31     105.0       NaN       NaN

or pivot_table():

In [156]: df.pivot_table(index='Date', columns='Identifier', values='Price', fill_value=0).rename_axis(None, 1)
Out[156]:
            123457ZB  353477ZB  693477AA
Date
1990/10/31         0         0       100
1991/08/31         0       101         0
1992/08/31       105         0         0

PS if you prefer to "pivot" data on the SQL Server side - please check this question

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
1

Or you can using groupby.

df.groupby(['Identifier','Date'])['Price'].sum().unstack('Identifier')
BENY
  • 317,841
  • 20
  • 164
  • 234