2

After all the processing I was able to create below dataframe. The only problem with it is that the year is incorrect. The date is in decreasing order for each Location. So after 2015-01-15 it should be 2014-12-15, not 2015-12-15.

+--------------------+---------------+-------+
|   Location         | Date          | Value |
+--------------------+---------------+-------+
| India              | 2015-03-15    |   -200|
| India              | 2015-02-15    |  140  |
| India              | 2015-01-15    |  155  |
| India              | 2015-12-15    |   85  |
| India              | 2015-11-15    |   45  |
| China              | 2015-03-15    |   199 |
| China              | 2015-02-15    |  164  |
| China              | 2015-01-15    |  209  |
| China              | 2015-12-15    |   24  |
| China              | 2015-11-15    |   11  |
| Russia             | 2015-03-15    |   48  |
| Russia             | 2015-02-15    |  104  |
| Russia             | 2015-01-15    |  106  |
| Russia             | 2015-12-15    |   -20 |
| Russia             | 2015-11-15    |   10  |
Chankey Pathak
  • 21,187
  • 12
  • 85
  • 133

3 Answers3

2

Making the strong assumption that these are monthly dates ending on the 15th of every month and that the first value for a given Location is correct, we can step backwards monthly by Location.

# Create original dataframe.
df = pd.DataFrame({'Location': ['India'] * 5 + ['China'] * 5 + ['Russia'] * 5,
                   'Date': ['2015-03-15', '2015-02-15', '2015-01-15', '2015-12-15', '2015-11-15'] * 3,
                   'Value': [-200, 140, 155, 85, 45, 199, 164, 209, 24, 11, 48, 104, 106, -20, 10]})[
    ['Location', 'Date', 'Value']
]
# Convert dates to pandas Timestamps.
df['Date'] = pd.DatetimeIndex(df['Date'])

gb = df.groupby(['Location'])['Date']
df['Date'] = [
    str(first_period - months) + '-15'
     for location_months, first_period in zip(
         gb.count(), gb.first().apply(lambda date: pd.Period(date, 'M'))) 
     for months in range(location_months)
]
>>> df
   Location        Date  Value
0     India  2015-03-15   -200
1     India  2015-02-15    140
2     India  2015-01-15    155
3     India  2014-12-15     85
4     India  2014-11-15     45
5     China  2015-03-15    199
6     China  2015-02-15    164
7     China  2015-01-15    209
8     China  2014-12-15     24
9     China  2014-11-15     11
10   Russia  2015-03-15     48
11   Russia  2015-02-15    104
12   Russia  2015-01-15    106
13   Russia  2014-12-15    -20
14   Russia  2014-11-15     10

The final dates are in string form which you may again wish to convert back to Timestamps via:

df['Date'] = pd.DatetimeIndex(df['Date'])
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • 1
    Not sure sorting is what OP wants. They want to convert erroneous dates to the right ones by subtracting a year. – cs95 Aug 24 '17 at 08:38
  • It is virtually impossible to solve without knowing how the dates were coded. For example, how may years of data could there be? Which values are 'correct'? – Alexander Aug 24 '17 at 08:42
  • @Alexander You can do it based on location. If you find 'January' then the following dates should have year-1 for the same location. Then the same logic would apply on other location. – Chankey Pathak Aug 24 '17 at 08:45
2

You have to iterate over Date Series in pandas dataframe as following and check if previous date is January to subtract a year(365 days) from date.

from dateutil.relativedelta import  relativedelta

for idx, date in df['Date'].iteritems()[1:]:
    if df['Date'].iloc[idx-1].month == 1:
        date = date - relativedelta(years=1)
        # date = date - pd.DateOffset(years=1)

EDIT: relativedelta will catch leap years or you may use pd.DateOffset(years=1) in that place.

Hope it helps!

abhinav
  • 1,108
  • 11
  • 23
1

If you dont mind using loops, you can do it so -

import pandas as pd

dt = ["2015-03-15", "2015-02-15", "2015-01-15", "2015-12-15", "2015-11-15", 
"2015-03-15", "2015-02-15", "2015-01-15", "2015-12-15", "2015-11-15", "2015-03-15", 
"2015-02-15", "2015-01-15", "2015-12-15", "2015-11-15"]

df = pd.DataFrame(dt,columns=['dt'])
cntry = ['India', 'China', 'Russia']*5
cntry.sort()
df.loc[:,'country'] = cntry

collect = []
for cntry in df.country.unique().tolist():
    # print(cntry)
    year_ = 0
    i = 0
    for dt in df.loc[df.country == cntry,'dt']:
        # print(df.loc[df.country == cntry,'dt'].iloc[i,], str(int(dt[:4])+year_)+dt[4:])
        collect.append(str(int(dt[:4])+year_)+dt[4:])
        if int(dt[5:7]) == 1:
            year_-=1    
        i+=1

df.loc[:,'dt'] = collect
Clock Slave
  • 7,627
  • 15
  • 68
  • 109