2

I am working with a large data set containing portfolio holdings of clients per date (i.e. in each time period, I have a number of stock investments for each person). My goal is to try and identify 'buys' and 'sells'. A buy happens when a new stock appears in a person's portfolio (compared to the previous period). A sell happens when a stock disappears in a person's portfolio (compared to the previous period). Is there an easy/efficient way to do this in Python? I can only think of a cumbersome way via for-loops.

Suppose we have the following dataframe:

enter image description here

which can be computed with the following code:

df = pd.DataFrame({'Date_ID':[1,1,1,1,2,2,2,2,2,2,3,3,3,3], 'Person':['a', 'a', 'b', 'b', 'a', 'a', 'a', 'a', 'b', 'b', 'a', 'a', 'a', 'b'], 'Stock':['x1', 'x2', 'x2', 'x3', 'x1', 'x2', 'x3', 'x4', 'x2', 'x3', 'x1', 'x2', 'x3', 'x3']})

I would like to create the 'bought' and 'sell' columns which identify stocks that have been added or are going to be removed from the portfolio. The buy column equals 'True' if the stock newly appears in the persons portfolio (compared to the previous date). The Sell column equals True in case the stock disappears from the person's portfolio the next date.

How to accomplish this (or something similar to identify trades efficiently) in Python?

John
  • 555
  • 3
  • 16
  • So should "Bought" of the second row (with index `1`) be `True`? It changed from 'x1' to 'x2' – Vladimir Fokow Aug 16 '22 at 13:23
  • 1
    "Bought" in the second row with should not be True. Person "a" in date 1 has a portfolio containing the x1 and x2 stocks (and person "b" has a portfolio containing x2 and x2 stocks at this time). – John Aug 16 '22 at 13:31
  • It can be the case that the "Bought" and "Sell" columns are equal to True at the same time. This means that the stock has been bought at time t (compared to t-1) and will get sold again at time t+1. – John Aug 16 '22 at 13:33

1 Answers1

3

You can group your dataframe by 'Person' first because people are completely independent from each other.

After that, for each person - group by 'Date_ID', and for each stock in a group determine if it is present in the next group:

def get_person_indicators(df):
    """`df` here contains info for 1 person only."""
    g = df.groupby('Date_ID')['Stock']

    prev_stocks = g.agg(set).shift()
    was_bought = g.transform(lambda s: ~s.isin(prev_stocks[s.name]) 
                             if not pd.isnull(prev_stocks[s.name]) 
                             else False)

    next_stocks = g.agg(set).shift(-1)
    will_sell = g.transform(lambda s: ~s.isin(next_stocks[s.name]) 
                            if not pd.isnull(next_stocks[s.name]) 
                            else False)

    return pd.DataFrame({'was_bought': was_bought, 'will_sell': will_sell})


result = pd.concat([df, df.groupby('Person').apply(get_person_indicators)], 
                   axis=1)

Note:

For better memory usage you can change the dtype of the 'Stock' column from str to Categorical:

df['Stock'] = df['Stock'].astype('category')
Vladimir Fokow
  • 3,728
  • 2
  • 5
  • 27