While this may start out sounding like as statistics question, please bear with me.
I have several calcium concentrations from water samples collected at different sampling locations. The water is resampled at some of the stations on a monthly, yearly, every other year basis.
I want to measure yearly and decadal changes in the calcium concentrations for groups of stations using a Wilcoxon-Pratt signed-rank test, as performed by Lindsey and Rupert (http://pubs.usgs.gov/sir/2012/5049/). To conduct the test, I want to create pairs of data that are separated by a year (time delta of 365 days) or as close to that time frame as possible. The paired measurements should have the same month, just different years. I would only need one pair per month per station. I would prefer not to average sample concentrations of samples that share the same station, month, and year.
Here is a sample of my data: https://raw.githubusercontent.com/inkenbrandt/IPython/master/Calcium_Samples.csv
SampleLocation CalciumConc_mgL
SampleDate
10/1/1947 0:00 USGS-09382000 66.0
10/15/1947 0:00 USGS-09382000 132.0
1/1/1948 0:00 USGS-09382000 130.0
1/15/1948 0:00 USGS-09382000 98.0
5/1/1948 0:00 USGS-09382000 82.0
5/15/1948 0:00 USGS-09382000 53.0
6/1/1948 0:00 USGS-09382000 142.0
9/1/1948 0:00 USGS-09382000 107.0
9/15/1948 0:00 USGS-09382000 59.0
10/1/1948 0:00 USGS-09382000 106.0
10/15/1948 0:00 USGS-09382000 102.0
5/15/1949 0:00 USGS-09382000 59.0
6/1/1949 0:00 USGS-09382000 50.0
6/15/1949 0:00 USGS-09382000 161.0
9/1/1949 0:00 USGS-09382000 82.0
9/15/1949 0:00 USGS-09382000 376.0
10/1/1949 0:00 USGS-09382000 210.0
10/15/1949 0:00 USGS-09382000 131.0
1/1/1950 0:00 USGS-09382000 132.0
... ... ...
9/20/1947 0:00 USGS-09288500 59.0
9/20/1947 0:00 USGS-09288500 59.0
6/9/1948 0:00 USGS-09288500 51.0
6/9/1948 0:00 USGS-09288500 51.0
9/29/1948 0:00 USGS-09288500 51.0
9/29/1948 0:00 USGS-09288500 51.0
9/10/1949 0:00 USGS-09288500 40.0
5/19/1941 0:00 USGS-09295000 33.0
6/16/1941 0:00 USGS-09295000 3.4
5/11/1947 0:00 USGS-09295000 42.0
6/22/1947 0:00 USGS-09295000 32.0
9/20/1947 0:00 USGS-09295000 97.0
6/9/1948 0:00 USGS-09295000 37.0
9/29/1948 0:00 USGS-09295000 126.0
9/10/1949 0:00 USGS-09295000 93.0
[429 rows x 2 columns]
I want to produce a Pandas dataframe that looks something like this:
SampleLocation SampleDate1 CaConc1 SampleDate2 CaConc2
USGS-09382000 10/1/1947 0:00 66.0 10/1/1948 0:00 106.0
USGS-09382000 10/15/1947 0:00 132.0 10/15/1948 0:00 102.0
USGS-09382000 5/15/1948 0:00 53.0 5/15/1949 0:00 59.0
... ... ... ... ...
USGS-09288500 9/20/1947 0:00 59.0 9/29/1948 0:00 51.0
I believe that this might be approached using the multi-indexing functionality in Pandas. So far, I have looked at the following stackoverflow question for help to match dates and manipulate using indexing:
I think the second link gets pretty close using unstacking multi-indexes, and I may be able to perform this if I am willing to aggregate, but I am trying to avoid that.
This technique would be relevant to others who want to analyze data with seasonal trends, such as comparing stream discharge or cumulative precipitation or temperature on the same day or close to the same day.