0

I have 2 DFs that have different sizes, something like:

DF1:

Metric Market BSegment MSegment Category Num Den Period
 KPI1   GA     PS       Comm      BF     20  100 202201
 KPI1   CE     PR       Cons      BF     70  128 202201
 KPI2   NWE    PR       Cons      On     12  389 202203
  .
  .
  .
 KPI20  ISE    PS       Cons      On     80  288 202204
 KPI30  GA     PS       Comm      BF     32  85  202204

DF2:

Metric Market BSegment MSegment Category StartPeriod EndPeriod Goal
 KPI1   *      PS       Comm      BF        202201     202205   5
 KPI1   *      PR       Cons      BF        202201     202205   10
 KPI2   NWE    PR        *        On        202112     202203   60
  .
  .
  .
 KPI2   ISE    PS        *        On        202112     202203   40
 KPI3   GA     PS       Comm      BF        202201     202206   22
 KPI3   CE     PS       Comm      BF        202201     202206   26
 KPI3   NWE    PS       Comm      BF        202201     202206   20

I am trying to get a new column in DF1 where I bring the values of the 'Goal' column in DF2 when the other columns are matchig. The main problem is that in DF2, in order to not repeat the same line multiple times, when a goal is applied, for example, to all markets instead of a line per market it states '*'.

Also there could be metrics in DF1 that are not appearing in DF2 as there are no goals.

The logic would be something like:

(df1['Metric'] == df2['Metric']) &
(df1['Market'] == df2['Market'] | df2['Market'] == '*') & 
(df1['BSegment'] == df2['BSegment'] | df2['BSegment'] == '*') &
(df1['MSegment'] == df2['MSegment'] | df2['MSegment'] == '*') &
(df1['Category'] == df2['Category'] | df2['Category'] == '*') &
(df2['StartPeriod'] <= df1['Period'] <= df2['EndPeriod'] )

I've tried with np.where and np.select as:

df1['Numerator.Goal'] = (np.where((df1['Metric'] == df2['Metric']) &
(df1['Market'] == df2['Market'] | df2['Market'] == '*') & 
(df1['BSegment'] == df2['BSegment'] | df2['BSegment'] == '*') &
(df1['MSegment'] == df2['MSegment'] | df2['MSegment'] == '*') &
(df1['Category'] == df2['Category'] | df2['Category'] == '*') &
(df2['StartPeriod'] <= df1['Period'] <= df2['EndPeriod'] ), df2['Goal'], np.nan))
--------
df1['Numerator.Goal'] = (np.select((df1['Metric'] == df2['Metric']) &
(df1['Market'] == df2['Market'] | df2['Market'] == '*') & 
(df1['BSegment'] == df2['BSegment'] | df2['BSegment'] == '*') &
(df1['MSegment'] == df2['MSegment'] | df2['MSegment'] == '*') &
(df1['Category'] == df2['Category'] | df2['Category'] == '*') &
(df2['StartPeriod'] <= df1['Period'] <= df2['EndPeriod'] ), df2['Goal']))

but it ends up with a "ValueError: Can only compare identically-labeled Series objects"

I've think also on doing a .merge() but I havn't seen how to deal with the '*' symbol so I have not been successful at it.

By any chance, would you know how to accomplish this? Any help is appreciated!

Julius
  • 3
  • 3
  • 1
    please get `.head(10).to_dict()` of your data and add that here. (for both `df`s) – MoRe May 09 '22 at 11:48
  • I guess it would be relatively easy to replace the `*` with what it stands for, i.e., expand the data so all possible values are explicitly there. The `Period` logic is what seems more difficult to me (at least it won't allow a simple `merge` AFAICT). – fsimonjetz May 09 '22 at 11:49
  • Hello @fsimonjetz ! Unfortunatelly I cannot change the `*` – Julius May 09 '22 at 12:12
  • Problem is that df1 and df2 length don't match. – Ynjxsjmh May 09 '22 at 12:33
  • Hello @MoRe this is some organizational data that I could not really share directly this is why I used the df1 and df2 as some pseudo DFs. Also to mention there are many columns in df1 so it would be a huge comment. – Julius May 09 '22 at 12:35

2 Answers2

0

So let's say you have these two dataframes:

import pandas as pd

df = pd.DataFrame({'Market': ['GA', 'CE'], 
                   'Period': [1, 2]})
df2 = pd.DataFrame({'Market': ['CE', 'MWE', 'GA', '*'], 
                    'Period': [2, 3, 4, 1], 
                    'Goal': [21, 22, 23, 24]})

# df
  Market  Period
0     GA       1
1     CE       2

# df2
  Market  Period  Goal
0     CE       2    21
1    MWE       3    22
2     GA       4    23
3      *       1    24

What you want is to add the Goal values to the first 'df'. Merging gives

>>> df.merge(df2)
  Market  Period  Goal
0     CE       2    21

but you want

  Market  Period  Goal
0     GA       1    24
1     CE       2    21

because what the * actually means is:

# explicit meaning of the *-row
3     CE       1    24
3    MWE       1    24
3     GA       1    24

So one way to make it work is to replace the wildcard with the explicit values and merge afterwards. Here is one naive way to do it, there are probably more reasonable/pandaesque strategies with some additional merging, but for the sake of argument:

markets = ['CE', 'MWE', 'GA']

# Expand the wildcard '*' by replacing each occurrence with all markets
df2['Market'] = df2.Market.apply(lambda x: markets if x == '*' else x)

# Explode the lists into individual rows
df2 = df2.explode('Market')

The result of this is this variant of df2:

  Market  Period  Goal
0     CE       2    21
1    MWE       3    22
2     GA       4    23
3     CE       1    24
3    MWE       1    24
3     GA       1    24

Now, df.merge(df2) results in

  Market  Period  Goal
0     GA       1    24
1     CE       2    21

The same applies to other columns containing the wildcard *. The Period merging is really a separate question altogether. Check out this question, for example.

fsimonjetz
  • 5,644
  • 3
  • 5
  • 21
0
indexes = ["Metric" ,'Market',  'BSegment'  ,'MSegment',    "Category"]

def get(df, boolean_indexes):
    for index, boolean_index in zip(df.index, boolean_indexes):
        yield not np.any(df.iloc[index][boolean_index].values != "*")
def find(values, period):
    for value in values:
        if value[-3] <= period <= value[-2]:
            return value[-1]
df1.apply(lambda x: find(df2.loc[list(get(df2[indexes], df2[indexes].values != x[indexes].values))][['StartPeriod', 'EndPeriod' ,'Goal']].values, x["Period"]), axis=1)
MoRe
  • 2,296
  • 2
  • 3
  • 23