1

Good day, please all I want to expand the date range implied by StartDate and EndDate.

import pandas as pd
import datetime
from pandas_datareader import data, wb
import csv

out= open("testfile.csv", "rb")
data = csv.reader(out)
data = [[row[0],row[1] + row[2],row[3] + row[4], row[5],row[6]] for row in data]
out.close()
print data

out=open("data.csv", "wb")
output = csv.writer(out)

for row in data:
    output.writerow(row)

out.close()

df = pd.read_csv('data.csv')
for DateDpt, DateAr in df.iteritems():
    df.DateDpt = pd.to_datetime(df.DateDpt, format='%Y-%m-%d')
    df.DateAr = pd.to_datetime(df.DateAr, format='%Y-%m-%d')

df['DateAr'] = [pd.to_datetime(x, format='%Y-%m-%d') for x in df['DateAr']]
df['DateDpt'] = [pd.to_datetime(x, format='%Y-%m-%d') for x in df['DateDpt']]

df['range'] = df['DateDpt']-df['DateAr']

print df

ID      ArCityArCountry      DptCityDptCountry    EndDate       StartDate  
1922    ParisFrance          NewYorkUnitedState   2008-03-10    2008-12-01  
1901    ParisFrance          LagosNigeria         2001-03-05    2001-02-02  
1922    ParisFrance          NewYorkUnitedState   2011-02-03    2008-12-01  
1002    ParisFrance          CaliforniaUnitedState 2003-03-04   2002-03-04  
1099    ParisFrance           BeijingChina         2011-02-03   2009-02-04  
1901    LosAngelesUnitedState ParisFrance          2001-03-05   2001-02-02 

Output:

ID    ArCityArCountry       DptCityDptCountry     EndDate  
1922  ParisFrance           NewYorkUnitedState    2008-03-10    
1002  LosAngelesUnitedState ForidaUnitedState     2008-03-10     
1901  ParisFrance           LagosNigeria          2001-03-05     
1922  ParisFrance           NewYorkUnitedState    2011-02-03     
1002  ParisFrance           CaliforniaUnitedState 2003-03-04     
1099  ParisFrance           BeijingChina          2011-02-03     
1901  LosAngelesUnitedState ParisFrance           2001-03-05   

StartDate     range    
2001-02-02 2593 days    
2008-12-01  266 days     
2001-02-02   31 days    
2008-12-01  794 days    
2002-03-04  365 days  
2009-02-04  729 days  
2001-02-02   31 days  

Expected out:

Let's consider row1, we have 2593 days, I want a situation that from StartDate i.e. 2001-02-02 to the EndDate i.e 2008-03-10, be listed this should run through all the rows by expanding based on range until the value on StartDate matches EndDate.

ID    ArCityArCountry        DptCityDptCountry    StartDate    EndDate  
1922    ParisFrance          NewYorkUnitedState   2004-03-10    2008-12-01  
1922    ParisFrance          NewYorkUnitedState   2004-03-11    2008-12-01 
1922    ParisFrance          NewYorkUnitedState   2004-03-12    2008-12-01 

until it gets to that date of EndDate which means on both date I should have something like StartDate = EndDate, ie 2008-12-01 on both sides. Considering the csv

1922    ParisFrance          NewYorkUnitedState   2008-12-01    2008-12-01

Thank you so much

Another Question on it: Thanks.. I have another question. I want to create a JSON, considering StartDate (However, wherever there are two dates matching each other, one of the dates would be used while all attributes will be appended. Let me make an example, { "2001-02-02" = { ParisFrance (ArCityArCountry): 1922 NewYorkUnitedStates: 1922} } if we traverse the down csv we might likely to have another 2001-02-02. Instead of creating it, we can append it to initial StartDate. However, the DptCityDptCountry might be different but if another ID matches with the StartDate and DptCityDptCountry, it will be added up i.e.

{"2001-02-02" = { 
ParisFrance (ArCityArCountry): 1922, 2212  //these are IDs with same StartDate and ArCityArCountry   
NewYorkUnitedStates: 1922, 0029  //these are IDs with same StartDate and DptCityDptCountry}
}
Payne
  • 543
  • 2
  • 12
  • 32

3 Answers3

2

Starting with:

     ID        ArCityArCountry      DptCityDptCountry  EndDate StartDate
0  1922            ParisFrance     NewYorkUnitedState  3/10/08    2/2/01
1  1002  LosAngelesUnitedState      ForidaUnitedState  3/10/08   12/1/08
2  1901            ParisFrance           LagosNigeria   3/5/01    2/2/01
3  1922            ParisFrance     NewYorkUnitedState   2/3/11   12/1/08
4  1002            ParisFrance  CaliforniaUnitedState   3/4/03    3/4/02
5  1099            ParisFrance           BeijingChina   2/3/11    2/4/09
6  1901  LosAngelesUnitedState            ParisFrance   3/5/01    2/2/01

You can get your desired output as follows:

df.EndDate = pd.to_datetime(df.EndDate)
df.StartDate = pd.to_datetime(df.StartDate)
df = df.set_index('StartDate')
new_df = pd.DataFrame()
for i, data in df.iterrows():
    data = data.to_frame().transpose()
    data = data.reindex(pd.date_range(start=data.index[0], end=data.EndDate[0])).fillna(method='ffill').reset_index().rename(columns={'index': 'StartDate'})
    new_df = pd.concat([new_df, data])

new_df = new_df[['ID', 'ArCityArCountry', 'DptCityDptCountry', 'StartDate', 'EndDate']]

      ID        ArCityArCountry   DptCityDptCountry  StartDate    EndDate
0   1922            ParisFrance  NewYorkUnitedState 2001-02-02 2008-03-10
1   1922            ParisFrance  NewYorkUnitedState 2001-02-03 2008-03-10
2   1922            ParisFrance  NewYorkUnitedState 2001-02-04 2008-03-10
3   1922            ParisFrance  NewYorkUnitedState 2001-02-05 2008-03-10
4   1922            ParisFrance  NewYorkUnitedState 2001-02-06 2008-03-10
5   1922            ParisFrance  NewYorkUnitedState 2001-02-07 2008-03-10
6   1922            ParisFrance  NewYorkUnitedState 2001-02-08 2008-03-10
7   1922            ParisFrance  NewYorkUnitedState 2001-02-09 2008-03-10
8   1922            ParisFrance  NewYorkUnitedState 2001-02-10 2008-03-10
9   1922            ParisFrance  NewYorkUnitedState 2001-02-11 2008-03-10
10  1922            ParisFrance  NewYorkUnitedState 2001-02-12 2008-03-10
11  1922            ParisFrance  NewYorkUnitedState 2001-02-13 2008-03-10
12  1922            ParisFrance  NewYorkUnitedState 2001-02-14 2008-03-10
13  1922            ParisFrance  NewYorkUnitedState 2001-02-15 2008-03-10
14  1922            ParisFrance  NewYorkUnitedState 2001-02-16 2008-03-10
15  1922            ParisFrance  NewYorkUnitedState 2001-02-17 2008-03-10
16  1922            ParisFrance  NewYorkUnitedState 2001-02-18 2008-03-10
17  1922            ParisFrance  NewYorkUnitedState 2001-02-19 2008-03-10
18  1922            ParisFrance  NewYorkUnitedState 2001-02-20 2008-03-10
19  1922            ParisFrance  NewYorkUnitedState 2001-02-21 2008-03-10
20  1922            ParisFrance  NewYorkUnitedState 2001-02-22 2008-03-10
21  1922            ParisFrance  NewYorkUnitedState 2001-02-23 2008-03-10
22  1922            ParisFrance  NewYorkUnitedState 2001-02-24 2008-03-10
23  1922            ParisFrance  NewYorkUnitedState 2001-02-25 2008-03-10
24  1922            ParisFrance  NewYorkUnitedState 2001-02-26 2008-03-10
25  1922            ParisFrance  NewYorkUnitedState 2001-02-27 2008-03-10
26  1922            ParisFrance  NewYorkUnitedState 2001-02-28 2008-03-10
27  1922            ParisFrance  NewYorkUnitedState 2001-03-01 2008-03-10
28  1922            ParisFrance  NewYorkUnitedState 2001-03-02 2008-03-10
29  1922            ParisFrance  NewYorkUnitedState 2001-03-03 2008-03-10
..   ...                    ...                 ...        ...        ...
2   1901  LosAngelesUnitedState         ParisFrance 2001-02-04 2001-03-05
3   1901  LosAngelesUnitedState         ParisFrance 2001-02-05 2001-03-05
4   1901  LosAngelesUnitedState         ParisFrance 2001-02-06 2001-03-05
5   1901  LosAngelesUnitedState         ParisFrance 2001-02-07 2001-03-05
6   1901  LosAngelesUnitedState         ParisFrance 2001-02-08 2001-03-05
7   1901  LosAngelesUnitedState         ParisFrance 2001-02-09 2001-03-05
8   1901  LosAngelesUnitedState         ParisFrance 2001-02-10 2001-03-05
9   1901  LosAngelesUnitedState         ParisFrance 2001-02-11 2001-03-05
10  1901  LosAngelesUnitedState         ParisFrance 2001-02-12 2001-03-05
11  1901  LosAngelesUnitedState         ParisFrance 2001-02-13 2001-03-05
12  1901  LosAngelesUnitedState         ParisFrance 2001-02-14 2001-03-05
13  1901  LosAngelesUnitedState         ParisFrance 2001-02-15 2001-03-05
14  1901  LosAngelesUnitedState         ParisFrance 2001-02-16 2001-03-05
15  1901  LosAngelesUnitedState         ParisFrance 2001-02-17 2001-03-05
16  1901  LosAngelesUnitedState         ParisFrance 2001-02-18 2001-03-05
17  1901  LosAngelesUnitedState         ParisFrance 2001-02-19 2001-03-05
18  1901  LosAngelesUnitedState         ParisFrance 2001-02-20 2001-03-05
19  1901  LosAngelesUnitedState         ParisFrance 2001-02-21 2001-03-05
20  1901  LosAngelesUnitedState         ParisFrance 2001-02-22 2001-03-05
21  1901  LosAngelesUnitedState         ParisFrance 2001-02-23 2001-03-05
22  1901  LosAngelesUnitedState         ParisFrance 2001-02-24 2001-03-05
23  1901  LosAngelesUnitedState         ParisFrance 2001-02-25 2001-03-05
24  1901  LosAngelesUnitedState         ParisFrance 2001-02-26 2001-03-05
25  1901  LosAngelesUnitedState         ParisFrance 2001-02-27 2001-03-05
26  1901  LosAngelesUnitedState         ParisFrance 2001-02-28 2001-03-05
27  1901  LosAngelesUnitedState         ParisFrance 2001-03-01 2001-03-05
28  1901  LosAngelesUnitedState         ParisFrance 2001-03-02 2001-03-05
29  1901  LosAngelesUnitedState         ParisFrance 2001-03-03 2001-03-05
30  1901  LosAngelesUnitedState         ParisFrance 2001-03-04 2001-03-05
31  1901  LosAngelesUnitedState         ParisFrance 2001-03-05 2001-03-05
Stefan
  • 41,759
  • 13
  • 76
  • 81
1

This is just to initialize a DataFrame to be what I can see that you have:

cols = ['ID', 'ArCityArCountry', 'DptCityDptCountry', 'EndDate', 'StartDate']
df = pd.DataFrame(dict(ID=[1922, 1002, 1901, 1922, 1002, 1099, 1902],
                       ArCityArCountry=['ParisFrance',
                                      'LosAngelesUnitedStates',
                                      'ParisFrance',
                                      'ParisFrance',
                                      'ParisFrance',
                                      'ParisFrance',
                                      'LosAngelesUnitedStates'],
                       DptCityDptCountry=['NewYorkUnitedStates',
                                          'FloridaUnitedStates',
                                          'LagosNigeria',
                                          'NewYorkUnitedStates',
                                          'CaliforniaUnitedStates',
                                          'BeijingChina',
                                          'ParisFrance'],
                       EndDate=pd.to_datetime(['3/10/08',
                                               '3/10/08',
                                               '3/5/01',
                                               '2/3/11',
                                               '3/4/03',
                                               '2/3/11',
                                               '3/5/01']),
                       StartDate=pd.to_datetime(['2/2/01',
                                                 '12/1/08',
                                                 '2/2/01',
                                                 '12/1/08',
                                                 '3/4/02',
                                                 '2/4/09',
                                                 '2/2/01'])))[cols]

Then I push all but 1 column into the index with set_index. This leaves one column which comes back as a Series. Then use apply and return a series indexed on the expanded set of dates for each row (Series of Series = DataFrame). So for each of the 7 rows in the DataFrame, I get a series indexed on the expanded date range. Then its just clever stacking, naming, and reset_index.

# Use idx to clean up the set_index call
idx = ['ID', 'ArCityArCountry', 'DptCityDptCountry', 'EndDate']

def f(x):
    # x will be an element of a series with the values of the columns specified in idx
    # as the index value which is stored in the name attribute.
    # x.name[-1] is the last element of the name attribute which is the
    # EndDate.  This corresponds to the last element of the idx list above
    date_index = pd.Index(pd.date_range(x.StartDate, x.name[-1])

    # I return a named series so the 'Date' becomes a column name
    return pd.Series(x.StartDate, index=date_index, name='Date'))

temp = df.set_index(idx).apply(f, axis=1)
# I didn't have to wrap temp.stack() in a series but doing so allows me
# to name it and have that show up as a column name
final = pd.Series(temp.stack(), name='StartDate').reset_index()

The result looks like below (I removed StartDate and EndDate for aesthetic purposes)

print final[idx[:-1] + ['Date']]

        ID         ArCityArCountry    DptCityDptCountry       Date
0     1922             ParisFrance  NewYorkUnitedStates 2001-02-02
1     1922             ParisFrance  NewYorkUnitedStates 2001-02-03
2     1922             ParisFrance  NewYorkUnitedStates 2001-02-04
3     1922             ParisFrance  NewYorkUnitedStates 2001-02-05
4     1922             ParisFrance  NewYorkUnitedStates 2001-02-06
5     1922             ParisFrance  NewYorkUnitedStates 2001-02-07
6     1922             ParisFrance  NewYorkUnitedStates 2001-02-08
7     1922             ParisFrance  NewYorkUnitedStates 2001-02-09
8     1922             ParisFrance  NewYorkUnitedStates 2001-02-10
9     1922             ParisFrance  NewYorkUnitedStates 2001-02-11
10    1922             ParisFrance  NewYorkUnitedStates 2001-02-12
11    1922             ParisFrance  NewYorkUnitedStates 2001-02-13
12    1922             ParisFrance  NewYorkUnitedStates 2001-02-14
13    1922             ParisFrance  NewYorkUnitedStates 2001-02-15
14    1922             ParisFrance  NewYorkUnitedStates 2001-02-16
15    1922             ParisFrance  NewYorkUnitedStates 2001-02-17
16    1922             ParisFrance  NewYorkUnitedStates 2001-02-18
17    1922             ParisFrance  NewYorkUnitedStates 2001-02-19
18    1922             ParisFrance  NewYorkUnitedStates 2001-02-20
19    1922             ParisFrance  NewYorkUnitedStates 2001-02-21
20    1922             ParisFrance  NewYorkUnitedStates 2001-02-22
21    1922             ParisFrance  NewYorkUnitedStates 2001-02-23
22    1922             ParisFrance  NewYorkUnitedStates 2001-02-24
23    1922             ParisFrance  NewYorkUnitedStates 2001-02-25
24    1922             ParisFrance  NewYorkUnitedStates 2001-02-26
25    1922             ParisFrance  NewYorkUnitedStates 2001-02-27
26    1922             ParisFrance  NewYorkUnitedStates 2001-02-28
27    1922             ParisFrance  NewYorkUnitedStates 2001-03-01
28    1922             ParisFrance  NewYorkUnitedStates 2001-03-02
29    1922             ParisFrance  NewYorkUnitedStates 2001-03-03
...    ...                     ...                  ...        ...
4519  1901  LosAngelesUnitedStates          ParisFrance 2001-02-04
4520  1901  LosAngelesUnitedStates          ParisFrance 2001-02-05
4521  1901  LosAngelesUnitedStates          ParisFrance 2001-02-06
4522  1901  LosAngelesUnitedStates          ParisFrance 2001-02-07
4523  1901  LosAngelesUnitedStates          ParisFrance 2001-02-08
4524  1901  LosAngelesUnitedStates          ParisFrance 2001-02-09
4525  1901  LosAngelesUnitedStates          ParisFrance 2001-02-10
4526  1901  LosAngelesUnitedStates          ParisFrance 2001-02-11
4527  1901  LosAngelesUnitedStates          ParisFrance 2001-02-12
4528  1901  LosAngelesUnitedStates          ParisFrance 2001-02-13
4529  1901  LosAngelesUnitedStates          ParisFrance 2001-02-14
4530  1901  LosAngelesUnitedStates          ParisFrance 2001-02-15
4531  1901  LosAngelesUnitedStates          ParisFrance 2001-02-16
4532  1901  LosAngelesUnitedStates          ParisFrance 2001-02-17
4533  1901  LosAngelesUnitedStates          ParisFrance 2001-02-18
4534  1901  LosAngelesUnitedStates          ParisFrance 2001-02-19
4535  1901  LosAngelesUnitedStates          ParisFrance 2001-02-20
4536  1901  LosAngelesUnitedStates          ParisFrance 2001-02-21
4537  1901  LosAngelesUnitedStates          ParisFrance 2001-02-22
4538  1901  LosAngelesUnitedStates          ParisFrance 2001-02-23
4539  1901  LosAngelesUnitedStates          ParisFrance 2001-02-24
4540  1901  LosAngelesUnitedStates          ParisFrance 2001-02-25
4541  1901  LosAngelesUnitedStates          ParisFrance 2001-02-26
4542  1901  LosAngelesUnitedStates          ParisFrance 2001-02-27
4543  1901  LosAngelesUnitedStates          ParisFrance 2001-02-28
4544  1901  LosAngelesUnitedStates          ParisFrance 2001-03-01
4545  1901  LosAngelesUnitedStates          ParisFrance 2001-03-02
4546  1901  LosAngelesUnitedStates          ParisFrance 2001-03-03
4547  1901  LosAngelesUnitedStates          ParisFrance 2001-03-04
4548  1901  LosAngelesUnitedStates          ParisFrance 2001-03-05

[4549 rows x 4 columns]
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • It works perfectly but not exactly what I want. I think Stefan has produced the solution. Thanks sir, I do appreciate it – Payne Dec 17 '15 at 02:12
0

Standard Import

import pandas as pd
import numpy as np

Copy table to clipboard

     ID        ArCityArCountry      DptCityDptCountry  EndDate StartDate
0  1922            ParisFrance     NewYorkUnitedState  3/10/08    2/2/01
1  1002  LosAngelesUnitedState      ForidaUnitedState  3/10/08   12/1/08
2  1901            ParisFrance           LagosNigeria   3/5/01    2/2/01
3  1922            ParisFrance     NewYorkUnitedState   2/3/11   12/1/08
4  1002            ParisFrance  CaliforniaUnitedState   3/4/03    3/4/02
5  1099            ParisFrance           BeijingChina   2/3/11    2/4/09
6  1901  LosAngelesUnitedState            ParisFrance   3/5/01    2/2/01

Import Clipboard to DataFrame

df = pd.read_clipboard()

Set Date Columns to datetime64 dtype

df['StartDate'] = pd.to_datetime(df['StartDate'])
df['EndDate'] = pd.to_datetime(df['EndDate'])

Create Key since ID Is not unique in given dataset

df['Unique_ID'] = df.index

Set StartDate as Index for reindex to join data

df.set_index('StartDate', inplace=True)

The core of the data transformation

def reindex_by_date(df):
    dates = pd.date_range(df.index.min(), df['EndDate'].min())
    return df.reindex(dates).ffill()

Futher Clean up the table from here...but this is the desired result

df = df.groupby('Unique_ID').apply(reindex_by_date)

Source:

Pandas reindex dates in Groupby

Michael Gardner
  • 1,693
  • 1
  • 11
  • 13