1

I'm working on pulling financial data, in which some is formatted in quarterly and other is daily. My model will need all of it daily, therefore I need that same quarterly value repeated for each day. I've been using this stack post and trying to adapt the code to my data.

Here is my dataframe head:

  date        ticker   value
0 31/03/1980  ECB/RA6  1.0
1 30/06/1980  ECB/RA6  4.0
2 30/09/1980  ECB/RA6  2.0
3 30/12/1980  ECB/RA6  3.0
4 31/03/1981  ECB/RA6  2.0

Here is my desired output looks like:

   date        ticker   value
0  01/01/1980  ECB/RA6  1.0
1  02/01/1980  ECB/RA6  1.0
2  03/01/1980  ECB/RA6  1.0
3  04/01/1980  ECB/RA6  1.0
4  05/01/1980  ECB/RA6  1.0
.  .           .        .
.  .           .        .
.  .           .        .
91 01/04/1980  ECB/RA6  4.0

And my code:

df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')
df = df.pivot(index='date', columns='ticker')
start_date = df.index.min() - pd.DateOffset(day=1)
end_date = df.index.max() + pd.DateOffset(day=31)
dates = pd.date_range(start_date, end_date, freq='d')
dates.name = 'date'
df = df.reindex(dates, method='ffill')

df = df.stack('ticker')
df = df.sortlevel(level=1)
df = df.reset_index()

I see now what the problem is, but it shoudn't be a problem i believe. I run the following code from @Pierre D(after removing duplicates):

df = df.set_index('date')  # assuming 'date' is a proper Timestamp
df.index = df.index.to_period('Q')  # turn index into PeriodIndex('Q')
df = df.set_index('ticker', append=True).squeeze()
df2 = df[df.duplicated( keep = False)]

I get the following output for df2:

                value   value2
date    ticker      
1997Q2  AAPL    46850   NaN
1997Q3  AAPL    46850   NaN
2003Q1  MSFT    10137   19/12/2003
2003Q2  MSFT    10137   19/12/2003

As you can see the index is different, but value and value2 are equal in these instances. This should not be a problem I believe, but when I now run:

df.unstack()

I get the following error: "ValueError: Index contains duplicate entries, cannot reshape"

Thanks in advance everyone!

t.pellegrom
  • 313
  • 3
  • 10
  • 1
    You must have multiple tickers in the same dataframe? – Jakob L Dec 17 '20 at 23:45
  • 1
    Can you create a separate column for each ticker? That way you can index by date. – Jakob L Dec 17 '20 at 23:46
  • @JakobL Yes there are more than 600 tickers in my ticker column. Furthermore, to make the question better to read, I have left out other columns. So, besides the "value" column there are other columns. I'm unsure if it would be wise/feasible to approach it from that perspective. – t.pellegrom Dec 18 '20 at 09:41

1 Answers1

2

Update

The question was modified to indicate that there were other columns beside 'value', and from some of the comments I gather that "going wide" might be a concern (note: we routinely handle similar timeseries with thousands of columns without any problem).

So, here is another take. It does the same initial step of transforming the purported 'date' into what it really is: a quarterly Period. But then it applies a way of resampling time in a multi-index (time, key) by groups of key. There are several StackOverflow answers to that question, such as this one.

All together (with an example):

# setup for example
txt = """  date        ticker   value   value2
0 31/03/1980  ECB/RA6  1.0  NA
1 30/06/1980  another  4.0  NA
2 30/09/1980  ECB/RA6  2.0  19/12/2003
3 30/12/1980  ECB/RA6  3.0  19/12/2003
4 31/03/1981  ECB/RA6  2.0  19/12/2003
"""
df = pd.read_csv(io.StringIO(re.sub(r' +', '\t', txt)),
                 sep='\t', index_col=[0],
                 parse_dates=['date', 'value2'])

# set date as index and convert to quarterly periods
df = df.set_index('date')
df.index = df.index.to_period('Q')

# and now the new resample method (here monthly,
# but change to 'D' for daily)
df = df.groupby('ticker').resample('M').ffill()

You can then .reset_index() if you prefer, or just leave it as is. Here is the result without resetting the index:

>>> df
                  ticker  value     value2
ticker  date                              
ECB/RA6 1980-03  ECB/RA6    1.0        NaT
        1980-04  ECB/RA6    1.0        NaT
        1980-05  ECB/RA6    1.0        NaT
        1980-06  ECB/RA6    1.0        NaT
        1980-07  ECB/RA6    1.0        NaT
        1980-08  ECB/RA6    1.0        NaT
        1980-09  ECB/RA6    2.0 2003-12-19
        1980-10  ECB/RA6    2.0 2003-12-19
        1980-11  ECB/RA6    2.0 2003-12-19
        1980-12  ECB/RA6    3.0 2003-12-19
        1981-01  ECB/RA6    3.0 2003-12-19
        1981-02  ECB/RA6    3.0 2003-12-19
        1981-03  ECB/RA6    2.0 2003-12-19
another 1980-06  another    4.0        NaT

Original answer

Here is what I would do: first, set your date to be the index and convert it into a PeriodIndex, then make your df "wide" by putting each ticker into a column. Then just resample:

df = df.set_index('date')  # assuming 'date' is a proper Timestamp
df.index = df.index.to_period('Q')  # turn index into PeriodIndex('Q')
df = df.set_index('ticker', append=True).squeeze().unstack()  # make wide: 1 col per ticker
df.resample('D').ffill()  # resample to daily, repeating the values

Outcome:

             value
ticker     ECB/RA6
date              
1980-01-01     1.0
1980-01-02     1.0
1980-01-03     1.0
1980-01-04     1.0
1980-01-05     1.0
...            ...
1981-03-27     2.0
1981-03-28     2.0
1981-03-29     2.0
1981-03-30     2.0
1981-03-31     2.0

Perhaps it's easier to inspect the outcome if you resample by month instead:

df.resample('M').ffill()  # resample to daily, repeating the values

# out:
ticker   ECB/RA6
date            
1980-01      1.0
1980-02      1.0
1980-03      1.0
1980-04      4.0
1980-05      4.0
1980-06      4.0
1980-07      2.0
1980-08      2.0
1980-09      2.0
1980-10      3.0
1980-11      3.0
1980-12      3.0
1981-01      2.0
1981-02      2.0
1981-03      2.0

Incidentally, it is useful to observe what happens with missing data:

# with input df as:
        date   ticker  value
0 1980-03-31  ECB/RA6    1.0
1 1980-06-30  another    4.0
2 1980-09-30  ECB/RA6    2.0

# output:
ticker   ECB/RA6  another
date                     
1980-01      1.0      NaN
1980-02      1.0      NaN
1980-03      1.0      NaN
1980-04      NaN      4.0
1980-05      NaN      4.0
1980-06      NaN      4.0
1980-07      2.0      NaN
1980-08      2.0      NaN
1980-09      2.0      NaN

Final note: of course, you can stack again the result if you want to get it as a tall and skinny table (and even reset the index, if you prefer):

print(df.resample('M').ffill().stack().reset_index())

# out:
      date   ticker    0
0  1980-01  ECB/RA6  1.0
1  1980-02  ECB/RA6  1.0
2  1980-03  ECB/RA6  1.0
3  1980-04  another  4.0
4  1980-05  another  4.0
5  1980-06  another  4.0
6  1980-07  ECB/RA6  2.0
7  1980-08  ECB/RA6  2.0
8  1980-09  ECB/RA6  2.0
Pierre D
  • 24,012
  • 7
  • 60
  • 96
  • When I run your code I get the exact same error message as when I run my own code: "ValueError: Index contains duplicate entries, cannot reshape". This error occurs in the third line of your code: df = df.set_index('ticker', append=True).squeeze().unstack(). I believe that this is caused by the fact that there are duplicates in the date column. – t.pellegrom Dec 18 '20 at 10:01
  • 1
    Do you mean that `df.groupby(['date', 'ticker']).size().max() > 1'`? Repeating dates (for different tickers) would be expected and handled just fine, but not repeated values of the `(date, ticker)` tuple. If there are, you need to decide how to handle the values (if different). One quick first check is `df.drop_duplicates()` in case some entire rows are repeated. – Pierre D Dec 18 '20 at 12:42
  • You were right that there were duplicates in the data for the `(date,ticker)` tuple. However, when I use `df = df.drop_duplicates(['date','ticker'])` and get `df.groupby(['date', 'ticker']).size().max() = 1` I still get the same error at the `unstack()` part of the code. – t.pellegrom Dec 19 '20 at 09:05
  • I edited the question and specified what specifically appears to cause this error. It appears to be a problem where not the date and ticker (the index tuple) are duplicates, but the remaining values of the dataframe. I fail to understand why those values are required to be different by `unstack()` though. – t.pellegrom Dec 19 '20 at 10:07
  • 1
    ok, since having multiple columns changes the question, I adapted the answer. Please see if that works for you. – Pierre D Dec 19 '20 at 12:41