0

I have the following data. Note for each combination of FPEDATS and CUSIP, there are multiple STATPERS.

CUSIP,STATPERS,NUMEST,FPEDATS
87482X10,20140417,4,20141231
87482X10,20140515,4,20141231
87482X10,20140619,5,20141231
12345678,20140717,3,20141231
12345678,20140814,5,20141231
87482X10,20140417,1,20151231
87482X10,20140515,2,20151231
87482X10,20140619,3,20151231
12345678,20140717,4,20151231
12345678,20140814,5,20151231

If I use panda.read_csv with columns = ["FPEDATS", "CUSIP"]. I get the following.

                      STATPERS    NUMEST
CUSIP    FPEDATS                      
87482X10 2014-12-31 2014-04-17       4
         2014-12-31 2014-05-15       4
         2014-12-31 2014-06-19       5
12345678 2014-12-31 2014-07-17       3
         2014-12-31 2014-08-14       5
87482X10 2015-12-31 2014-04-17       1
         2015-12-31 2014-05-15       2
         2015-12-31 2014-06-19       3
12345678 2015-12-31 2014-07-17       4
         2015-12-31 2014-08-14       5

For each combination of CUSIP and FPEDATS, I am interested in:

  1. can we only keep the one with the the largest date stamp and delete the other rows? (Basically filter by largest STATPERS)
  2. can we calculate an average of the the values of NUMEST for each combination of indices, either save this or replace the current data with it?
  3. can we, for example, assign a new attribute, for example "price" which is common to each combination of CUSIP and FPEDATS.

For 2, I mean something like:

                      STATPERS    NUMEST
CUSIP    FPEDATS
12345678 2014-12-31 2014-08-14      5
         2015-12-31 2014-08-14      5
87482X10 2014-12-31 2014-06-19      5
         2015-12-31 2014-06-19      3

For 3, I mean I want something like

                      STATPERS  NUMEST  Price
CUSIP    FPEDATS                      
87482X10 2014-12-31 2014-04-17       4   2
         2014-12-31 2014-05-15       4   2
         2014-12-31 2014-06-19       5   2
12345678 2014-12-31 2014-07-17       3   4
         2014-12-31 2014-08-14       5   4
87482X10 2015-12-31 2014-04-17       1   5
         2015-12-31 2014-05-15       2   5
         2015-12-31 2014-06-19       3   5
12345678 2015-12-31 2014-07-17       4   7
         2015-12-31 2014-08-14       5   7

when I only want to parse in [("CUSIP", "FPEDATS", "PRICE")] (as a list of tuple or a Panda Series)? Is there a slick way of doing that?

So far, the only way I came up with solving all of these problems is to use csv reader in the csv package to read the csvs, then throw everything back to panda format. I hope there is a simpler way of doing this.

UPDATE: I managed to do 2 by a MultiIndexing approach. basically calculate The MultiIndex by using

c = analyst_file.groupby(["CUSIP", "FPEDATS"])["STATPERS"].max()

Then we write these indices as a list of tuples and use slicing. Still would like do things a bit more slickly.

Lost1
  • 990
  • 1
  • 14
  • 34

1 Answers1

1

assuming we have the following DF:

In [193]: df
Out[193]:
                      STATPERS  NUMEST
CUSIP    FPEDATS
87482X10 2014-12-31 2014-04-17       4
         2014-12-31 2014-05-15       4
         2014-12-31 2014-06-19       5
12345678 2014-12-31 2014-07-17       3
         2014-12-31 2014-08-14       5
87482X10 2015-12-31 2014-04-17       1
         2015-12-31 2014-05-15       2
         2015-12-31 2014-06-19       3
12345678 2015-12-31 2014-07-17       4
         2015-12-31 2014-08-14       5

we can do the following:

In [192]: df.groupby(level=[0,1]).agg({'STATPERS':'max','NUMEST':'mean'})
Out[192]:
                      STATPERS    NUMEST
CUSIP    FPEDATS
12345678 2014-12-31 2014-08-14  4.000000
         2015-12-31 2014-08-14  4.500000
87482X10 2014-12-31 2014-06-19  4.333333
         2015-12-31 2014-06-19  2.000000

PS please elaborate on common to each combination of CUSIP and FPEDATS - it's not clear...

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • thank you for your answer, I have updated my question. I found the groupby().max() but that didn't quite do the trick. – Lost1 Jul 16 '17 at 16:18
  • maybe I should have made it clearer, what you have written is not *quite* what I am looking for though I was also interested in doing what you have written. My question 1 and question 2 were separate questions. – Lost1 Jul 16 '17 at 16:20
  • STATPERS is unique for each combination of CUSIP and FPEDATS, so STATPERS NUMEST CUSIP alone can be used identify the NUMEST uniquely. I wanted to find the max STATPERS for each combination of CUSIP and FPEDATS. Then find the NUMEST corresponding to each CUSIP, FPEDATS and NUMEST. – Lost1 Jul 16 '17 at 16:22
  • @Lost1, [SO encorauges you to ask one question per post](https://meta.stackexchange.com/questions/39223/one-post-with-multiple-questions-or-multiple-posts). Beside that it's highly recommended to post a desired data set when asking Numpy/Pandas/TensorFlow/etc. questions... Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit your post correspondingly. – MaxU - stand with Ukraine Jul 16 '17 at 16:24
  • noted, I will do that in the future. I have updated the question. thanks a lot for your help – Lost1 Jul 16 '17 at 16:25
  • @Lost1, i don't understand the logic behind the values in the `Price` column in desired data set (`For 3`)... – MaxU - stand with Ukraine Jul 16 '17 at 16:29
  • I supply a series of things which are unique for each "CUSIP FPEDATS" and want to join it to the existing. effectively "broadcasting" it to the dimensions "STATPERS NUMEST". – Lost1 Jul 16 '17 at 16:33