I have the following data. Note for each combination of FPEDATS and CUSIP, there are multiple STATPERS.
CUSIP,STATPERS,NUMEST,FPEDATS
87482X10,20140417,4,20141231
87482X10,20140515,4,20141231
87482X10,20140619,5,20141231
12345678,20140717,3,20141231
12345678,20140814,5,20141231
87482X10,20140417,1,20151231
87482X10,20140515,2,20151231
87482X10,20140619,3,20151231
12345678,20140717,4,20151231
12345678,20140814,5,20151231
If I use panda.read_csv with columns = ["FPEDATS", "CUSIP"]. I get the following.
STATPERS NUMEST
CUSIP FPEDATS
87482X10 2014-12-31 2014-04-17 4
2014-12-31 2014-05-15 4
2014-12-31 2014-06-19 5
12345678 2014-12-31 2014-07-17 3
2014-12-31 2014-08-14 5
87482X10 2015-12-31 2014-04-17 1
2015-12-31 2014-05-15 2
2015-12-31 2014-06-19 3
12345678 2015-12-31 2014-07-17 4
2015-12-31 2014-08-14 5
For each combination of CUSIP and FPEDATS, I am interested in:
- can we only keep the one with the the largest date stamp and delete the other rows? (Basically filter by largest STATPERS)
- can we calculate an average of the the values of NUMEST for each combination of indices, either save this or replace the current data with it?
- can we, for example, assign a new attribute, for example "price" which is common to each combination of CUSIP and FPEDATS.
For 2, I mean something like:
STATPERS NUMEST
CUSIP FPEDATS
12345678 2014-12-31 2014-08-14 5
2015-12-31 2014-08-14 5
87482X10 2014-12-31 2014-06-19 5
2015-12-31 2014-06-19 3
For 3, I mean I want something like
STATPERS NUMEST Price
CUSIP FPEDATS
87482X10 2014-12-31 2014-04-17 4 2
2014-12-31 2014-05-15 4 2
2014-12-31 2014-06-19 5 2
12345678 2014-12-31 2014-07-17 3 4
2014-12-31 2014-08-14 5 4
87482X10 2015-12-31 2014-04-17 1 5
2015-12-31 2014-05-15 2 5
2015-12-31 2014-06-19 3 5
12345678 2015-12-31 2014-07-17 4 7
2015-12-31 2014-08-14 5 7
when I only want to parse in [("CUSIP", "FPEDATS", "PRICE")] (as a list of tuple or a Panda Series)? Is there a slick way of doing that?
So far, the only way I came up with solving all of these problems is to use csv reader in the csv package to read the csvs, then throw everything back to panda format. I hope there is a simpler way of doing this.
UPDATE: I managed to do 2 by a MultiIndexing approach. basically calculate The MultiIndex by using
c = analyst_file.groupby(["CUSIP", "FPEDATS"])["STATPERS"].max()
Then we write these indices as a list of tuples and use slicing. Still would like do things a bit more slickly.