1

I have a pandas dataframe like as shown below

Company,year                                   
T123 Inc Ltd,1990
T124 PVT ltd,1991
T345 Ltd,1990
T789 Pvt.LTd,2001
ABC Limited,1992
ABCDE Ltd,1994
ABC Ltd,1997
ABFE,1987
Tesla ltd,1995
AMAZON Inc,2001
Apple ltd,2003

tf = pd.read_clipboard(sep=',')
tf['Company_copy'] = tf['Company']

I would like to compare each value from tf['company'] against 5 rows above and 5 rows below from its current position.

For ex: I want T123 Inc Ltd to be compared with T124,T345,T789,ABC,ACDE. As T123 is the top row, there is nothing to compare above, we don't generate any comparison pair.

Similarly, if there is noT enough (5 rows) rows to compare, we compare it with whatever we have.

So, I tried the below with the help of this post here

pd.MultiIndex.from_product([tf['Company'].astype(str),tf['Company_copy'].astype(str)]).to_series()

but it produces m*n comparison across all rows of the dataframe

This is because, we have million records and cannot use multiindex_from product to generate all comparisons (which is not useful for us). Just the top and bottom 5 comparison pair for each row is enough. Is there anyway to apply filter multiindex to generate pair based on above and below 5 rows?

I expect my output to be like as below. I show for only one record T123 Inc Ltd.

Company       Company     
  
T123 Inc Ltd  T124 PVT ltd    (T123 Inc Ltd, T124 PVT ltd)
              T345 Ltd            (T123 Inc Ltd, T345 Ltd)
              T789 Pvt.LTd    (T123 Inc Ltd, T789 Pvt.LTd)
              ABC Limited      (T123 Inc Ltd, ABC Limited)
              ABCDE Ltd          (T123 Inc Ltd, ABCDE Ltd)
The Great
  • 7,215
  • 7
  • 40
  • 128
  • Could you maybe add an example with a company where there is top and bottom 5 (instead of first row)? I'm not sure what the expected output looks like then. – Rabinzel May 26 '22 at 10:20
  • I am responding via mobile. So, if we take 'ABCDE' as input, we want 5 rows above, which is T123, 124, 345, 789, ABC Limited and 5 rows below ABCDE which is ABC, ABFE, Tesla, Amazon, Apple. – The Great May 26 '22 at 10:26

2 Answers2

1

A possible solution using Series.rolling with center=True and window size 11 (=5+1+5) on Company and then exclude the tuples with the middle row:

from itertools import chain, product

idx = pd.MultiIndex.from_tuples(chain(*(product([row], win)
                         for row, win in zip(tf['Company'], tf['Company'].rolling(11, min_periods=1, center=True))))).to_series()
        
idx = idx[idx.index.get_level_values(0) != idx.index.get_level_values(1)]

Example results:

print(idx['T123 Inc Ltd'])
T124 PVT ltd    (T123 Inc Ltd, T124 PVT ltd)
T345 Ltd            (T123 Inc Ltd, T345 Ltd)
T789 Pvt.LTd    (T123 Inc Ltd, T789 Pvt.LTd)
ABC Limited      (T123 Inc Ltd, ABC Limited)
ABCDE Ltd          (T123 Inc Ltd, ABCDE Ltd)
dtype: object
print(idx['ABCDE Ltd'])
T123 Inc Ltd    (ABCDE Ltd, T123 Inc Ltd)
T124 PVT ltd    (ABCDE Ltd, T124 PVT ltd)
T345 Ltd            (ABCDE Ltd, T345 Ltd)
T789 Pvt.LTd    (ABCDE Ltd, T789 Pvt.LTd)
ABC Limited      (ABCDE Ltd, ABC Limited)
ABC Ltd              (ABCDE Ltd, ABC Ltd)
ABFE                    (ABCDE Ltd, ABFE)
Tesla ltd          (ABCDE Ltd, Tesla ltd)
AMAZON Inc        (ABCDE Ltd, AMAZON Inc)
Apple ltd          (ABCDE Ltd, Apple ltd)
dtype: object
print(idx['Apple ltd'])
ABCDE Ltd      (Apple ltd, ABCDE Ltd)
ABC Ltd          (Apple ltd, ABC Ltd)
ABFE                (Apple ltd, ABFE)
Tesla ltd      (Apple ltd, Tesla ltd)
AMAZON Inc    (Apple ltd, AMAZON Inc)
dtype: object

Next five rows:

There are two ways of accomplishing this:

  • Using a custom indexer with forward indexing:
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=6)
idx = pd.MultiIndex.from_tuples(chain(*(product([row], win[1:])
                         for row, win in zip(tf['Company'], 
                                             tf['Company'].rolling(indexer, min_periods=1))))).to_series()
  • Reversing the DataFrame before rolling (I learned this one in mozway's answer here):
idx = pd.MultiIndex.from_tuples(chain(*(product([row], win[::-1])
                         for row, win in zip(tf['Company'][::-1], 
                                             tf['Company'][::-1]
                                             .rolling(5, min_periods=1, closed='left'))))).to_series()

Example:

print(idx['T123 Inc Ltd'])
T124 PVT ltd    (T123 Inc Ltd, T124 PVT ltd)
T345 Ltd            (T123 Inc Ltd, T345 Ltd)
T789 Pvt.LTd    (T123 Inc Ltd, T789 Pvt.LTd)
ABC Limited      (T123 Inc Ltd, ABC Limited)
ABCDE Ltd          (T123 Inc Ltd, ABCDE Ltd)
dtype: object

However, I think here the custom indexer is the cleanest because it is clearer what it does.

user2246849
  • 4,217
  • 1
  • 12
  • 16
  • Nice answer. I was reading rolling function. When we give size of the moving window as 11, what does it do? Which part of your code, tells it to look backward and forward? How does it know to split the window size 11 as backward 5, forward 5 – The Great May 26 '22 at 12:06
  • @TheGreat by setting the parameter `center=True` we ensure 5 rows are before and 5 are after (if they are available). We need 11 as size because it will be 5 (before) + 1 (current row) + 5 (after) – user2246849 May 26 '22 at 12:08
  • But what if we set center = False, it will look at only past 5 points (5 points above from current position)? Let's say if I want to look at only 5 points below from current position? How can we do that? – The Great May 27 '22 at 00:14
  • @TheGreat to look ahead you either have to use a custom indexer or reverse the DataFrame before rolling. I edited my answer with an example, hope that helps. – user2246849 May 27 '22 at 06:06
1

Depending on how really do you want to compare, this may or may not work for you.

I would loop over the dataframe and create new columns. Something like...

for i in range(5):
    # df['company_value'] represents the column that contains representative company values
    df[f'compare below {i}'] = list(df.loc[[i+1:],'company_value']) + ([np.nan] * (i+1))

for i in range(5):
    # df['company_value'] represents the column that contains representative company values
    df[f'compare above {i}'] = ([np.nan] * (i+1)) + list(df.loc[[:-(i+1)],'company_value'])

these 2 loops will give you 2x5 columns that contain values for 5 companies above and below. You can perform and operations on them and then drop the columns

The syntax can definitely be improved (maybe without having to use list), but this should work just fine.

Ahmed
  • 11
  • 2