Use Python/Pandas to Match Sample Pairs Yearly Data

Question

While this may start out sounding like as statistics question, please bear with me.

I have several calcium concentrations from water samples collected at different sampling locations. The water is resampled at some of the stations on a monthly, yearly, every other year basis.

I want to measure yearly and decadal changes in the calcium concentrations for groups of stations using a Wilcoxon-Pratt signed-rank test, as performed by Lindsey and Rupert (http://pubs.usgs.gov/sir/2012/5049/). To conduct the test, I want to create pairs of data that are separated by a year (time delta of 365 days) or as close to that time frame as possible. The paired measurements should have the same month, just different years. I would only need one pair per month per station. I would prefer not to average sample concentrations of samples that share the same station, month, and year.

Here is a sample of my data: https://raw.githubusercontent.com/inkenbrandt/IPython/master/Calcium_Samples.csv

                SampleLocation  CalciumConc_mgL
SampleDate                                     
10/1/1947 0:00   USGS-09382000             66.0
10/15/1947 0:00  USGS-09382000            132.0
1/1/1948 0:00    USGS-09382000            130.0
1/15/1948 0:00   USGS-09382000             98.0
5/1/1948 0:00    USGS-09382000             82.0
5/15/1948 0:00   USGS-09382000             53.0
6/1/1948 0:00    USGS-09382000            142.0
9/1/1948 0:00    USGS-09382000            107.0
9/15/1948 0:00   USGS-09382000             59.0
10/1/1948 0:00   USGS-09382000            106.0
10/15/1948 0:00  USGS-09382000            102.0
5/15/1949 0:00   USGS-09382000             59.0
6/1/1949 0:00    USGS-09382000             50.0
6/15/1949 0:00   USGS-09382000            161.0
9/1/1949 0:00    USGS-09382000             82.0
9/15/1949 0:00   USGS-09382000            376.0
10/1/1949 0:00   USGS-09382000            210.0
10/15/1949 0:00  USGS-09382000            131.0
1/1/1950 0:00    USGS-09382000            132.0
...                        ...              ...
9/20/1947 0:00   USGS-09288500             59.0
9/20/1947 0:00   USGS-09288500             59.0
6/9/1948 0:00    USGS-09288500             51.0
6/9/1948 0:00    USGS-09288500             51.0
9/29/1948 0:00   USGS-09288500             51.0
9/29/1948 0:00   USGS-09288500             51.0
9/10/1949 0:00   USGS-09288500             40.0
5/19/1941 0:00   USGS-09295000             33.0
6/16/1941 0:00   USGS-09295000              3.4
5/11/1947 0:00   USGS-09295000             42.0
6/22/1947 0:00   USGS-09295000             32.0
9/20/1947 0:00   USGS-09295000             97.0
6/9/1948 0:00    USGS-09295000             37.0
9/29/1948 0:00   USGS-09295000            126.0
9/10/1949 0:00   USGS-09295000             93.0

[429 rows x 2 columns]

I want to produce a Pandas dataframe that looks something like this:

SampleLocation   SampleDate1     CaConc1    SampleDate2     CaConc2
USGS-09382000    10/1/1947 0:00     66.0    10/1/1948 0:00    106.0
USGS-09382000    10/15/1947 0:00   132.0    10/15/1948 0:00   102.0
USGS-09382000    5/15/1948 0:00     53.0    5/15/1949 0:00     59.0
...              ...                 ...    ...                 ...
USGS-09288500    9/20/1947 0:00     59.0    9/29/1948 0:00     51.0

I believe that this might be approached using the multi-indexing functionality in Pandas. So far, I have looked at the following stackoverflow question for help to match dates and manipulate using indexing:

I think the second link gets pretty close using unstacking multi-indexes, and I may be able to perform this if I am willing to aggregate, but I am trying to avoid that.

This technique would be relevant to others who want to analyze data with seasonal trends, such as comparing stream discharge or cumulative precipitation or temperature on the same day or close to the same day.

You can either use a time delta (e.g. 365 days), or else look 'one-year' after the date in which case you'll have to account for leap year, missing data points, etc. Which would you prefer? — Alexander, Apr 23 '15 at 22:03
@Alexander The time delta sounds easier to implement, so I would prefer that. I have edited the post to reflect my preference. — Inkenbrandt, Apr 23 '15 at 22:07
Something that would probably get you close to the right answer and also be pretty fast would be to ```resample``` and then ```shift```. — JohnE, Apr 23 '15 at 22:41

Alexander · Accepted Answer · 2015-04-23T23:29:01.327

This method is a little messy, but I am trying to make it more robust to account for missing data.

First, we'll remove duplicates in the data and then convert the dates to Pandas Timestamps:

df = df.drop_duplicates()
df.SampleDate = [pd.Timestamp(ts) for ts in df.SampleDate]

Then let's arrange you DataFrame so that it is indexed on a unique set of dates (the columns will be the location IDs):

df2 = df.pivot_table(values='CalciumConc_mgL', 
                     index='SampleDate', 
                     columns='SampleLocation').ffill()

I've filled forward the values to make the results more robust. You may want to limit the number of days that potentially get filled forward (e.g. .ffill(limit=30)).

Now we can shift this DataFrame by 365 dates:

df2_lagged = df2.shift(365)

Stack the SampleLocation for both df2 and df2_lagged:

df2 = pd.DataFrame(df2.stack('SampleLocation', dropna=False))
df2_lagged = df2_lagged.stack('SampleLocation', dropna=False)

Now merge over the lagged data to df2. The DataFrames have the exact same structure, so you can just copy the values:

df2['lagged_val'] = df2_lagged

Finally, swap the Location and Dates and rename the columns:

result = df2.swaplevel(0, 1)
result.columns = ['CalciumConc_mgL', 'CalciumConc_mgL_lagged_12m']

Using a 60 day lag with you sample data:

>>> result
result.tail(10)
                                 CalciumConc_mgL  CalciumConc_mgL_lagged_12m
SampleLocation       SampleDate                                             
USGS-421548113205301 1950-01-01               59                          59
USGS-422818113225801 1950-01-01               59                         NaN
USGS-423200113472601 1950-01-01               33                          33
USGS-424006113355301 1950-01-01               62                          54
USGS-424142113340901 1950-01-01               54                          54
USGS-424348113242701 1950-01-01               40                         NaN
USGS-424431113412301 1950-01-01               46                         NaN
USGS-424511113291401 1950-01-01               38                          38
USGS-424518113282002 1950-01-01               39                          39
USGS-424659113433701 1950-01-01               39                          39

And to just index on the Location ID:

result = result.reset_index().set_index('SampleLocation')

>>> result.loc['USGS-09402500', :]
        CalciumConc_mgL  CalciumConc_mgL_lagged_12m
SampleDate                                             
1941-05-18              NaN                         NaN
1941-05-19              NaN                         NaN
1941-06-16              NaN                         NaN
1941-10-01              102                         NaN
1941-10-12              132                         NaN
1941-10-21              119                         NaN
1943-09-18              110                         NaN
1943-10-01              138                         NaN
1943-10-11              140                         NaN
1943-10-12              140                         NaN
1943-10-14              140                         NaN
1943-10-21              156                         NaN
1944-01-01              116                         NaN
1944-01-11              126                         NaN
1944-01-13              126                         NaN
1944-01-21              133                         NaN
1944-05-01               84                         NaN
1944-05-11               84                         NaN
1944-05-13               66                         NaN
1944-05-15               66                         NaN
1944-05-16               66                         NaN
1944-05-21               57                         NaN
1944-05-22               57                         NaN
1944-06-01               58                         NaN
1944-06-11               57                         NaN
1944-06-21               57                         NaN
1944-09-01              134                         NaN
1944-09-11              122                         NaN
1944-09-15              122                         NaN
1944-09-18              122                         NaN
...                     ...                         ...
1949-05-03               63                          62
1949-05-11               63                          62
1949-05-15               63                          62
1949-05-21               57                          62
1949-06-01               58                         133
1949-06-09               58                         128
1949-06-10               58                         128
1949-06-11               74                         128
1949-06-12               74                         128
1949-06-13               74                         124
1949-06-15               74                         112
1949-06-21               67                         123
1949-06-23               67                         123
1949-06-30               67                         123
1949-09-01              142                         123
1949-09-09              142                         123
1949-09-10              142                         131
1949-09-11              140                         106
1949-09-15              140                         108
1949-09-21              146                         108
1949-09-28              146                         102
1949-10-01              156                         102
1949-10-11              153                         102
1949-10-13              153                          68
1949-10-14              153                          68
1949-10-15              153                          63
1949-10-21              152                          63
1949-10-27              152                          63
1949-10-28              152                          63
1950-01-01              128                          60

I apologize if this is a dumb question, but how and where in your script are you implementing a "60 day lag with you sample data" — Inkenbrandt, Apr 24 '15 at 03:19
df2_lagged = df2.shift(365). Replace 365 with 60. The sample data wasn't sufficient for a 365 day lag, so I used a 60 day to illustrate the point. — Alexander, Apr 24 '15 at 03:21

Use Python/Pandas to Match Sample Pairs Yearly Data

1 Answers1