1

I have a Data table that has comma separated columns. I want to group by data based on each value in the comma separated column.

Date        Investment Type                                    Medium
1/1/2000    Mutual Fund, Stocks, Fixed Deposit, Real Estate    Own, Online,Through Agent
1/2/2000    Mutual Fund, Stocks, Real Estate                   Own
1/3/2000    Fixed Deposit                                      Online
1/3/2000    Mutual Fund, Fixed Deposit, Real Estate            Through Agent
1/2/2000    Stocks                                             Own, Online,                               Through Agent

I have to group by Medium and Investment type as given below. The Medium is given as input to the software I am writing.

Medium Investment Type Date

Online        Stocks            1/2/2000,1/1/2000
Own           Mutual Fund       1/1/2000,1/3/2000

I have done search using the input I receive and I do get the results. But I am not able to get into the aggregated format that I want. I am new to Python and Pandas. Appreciate your help. Thanks

Rahul Agarwal
  • 4,034
  • 7
  • 27
  • 51
Sid
  • 552
  • 6
  • 21

1 Answers1

1

First extract values in Medium column by list with Series.str.findall and regex word boundary:

L = ['Online','Own']
pat = '|'.join(r"\b{}\b".format(x) for x in L)
df['New_Medium'] = df.pop('Medium').str.findall('('+ pat + ')').str.join(', ')
#remove rows with empty values
df = df[df['New_Medium'].astype(bool)]

print (df)
       Date                                  Investment Type   New_Medium
0  1/1/2000  Mutual Fund, Stocks, Fixed Deposit, Real Estate  Own, Online
1  1/2/2000                 Mutual Fund, Stocks, Real Estate          Own
2  1/3/2000                                    Fixed Deposit       Online
4  1/2/2000                                           Stocks  Own, Online

Last get all combinations with product and last aggregate join:

from  itertools import product
df1 = pd.DataFrame([j for i in df.apply(lambda x: x.str.split(',\s*')).values 
                      for j in product(*i)], columns=df.columns)
df = df1.groupby(['Investment Type','New_Medium'])['Date'].agg(', '.join).reset_index()
print (df)
  Investment Type New_Medium                          Date
0   Fixed Deposit     Online            1/1/2000, 1/3/2000
1   Fixed Deposit        Own                      1/1/2000
2     Mutual Fund     Online                      1/1/2000
3     Mutual Fund        Own            1/1/2000, 1/2/2000
4     Real Estate     Online                      1/1/2000
5     Real Estate        Own            1/1/2000, 1/2/2000
6          Stocks     Online            1/1/2000, 1/2/2000
7          Stocks        Own  1/1/2000, 1/2/2000, 1/2/2000
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Ok, How do I take an input and perform the aggregation based on the input? – Sid Aug 31 '18 at 12:14
  • @Sid - Can you expalin more? Do you think filtering like [boolean indexing](https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas) ? – jezrael Aug 31 '18 at 12:15
  • @jezrale Assume that the Input from user is "Online,Own" and so the above aggregation has to be done only for "Online" and "Own" options of medium. Hope its clear now. – Sid Aug 31 '18 at 12:18
  • @Sid - Yes, I understand. Is not possible first create all combination like my answer and then filtering? – jezrael Aug 31 '18 at 12:19
  • The Product is creating cartesian product and so it contains combination of each element in each list with the other lists. thats making the tasks of filtering difficult. I am trying to filter the data first, then split if there is a match and aggregate the content. Any help for this is appreciated. Thanks – Sid Sep 03 '18 at 08:21
  • @Sid - oops, I find bug - need `findall` for return all possible values of `Medium`, edited answer. – jezrael Sep 03 '18 at 08:42
  • Your solution worked well. But the requirement is changed. Now if there is a match of Medium, I have to do aggregation based on each Investment type, Also, the Medium level grouping should be at list level and not at Individual item from List. i.e of Medium is [Online, Own]. Group data by [Online, Own] for each investment type that is found when we search in df for the Medium list. – Sid Sep 05 '18 at 12:17
  • I have posted a new question and its here https://stackoverflow.com/questions/52199952/aggregate-based-on-each-item-in-a-special-character-seperated-column-in-pandas – Sid Sep 06 '18 at 08:41
  • @Sid - Please give me some time, because meeting. – jezrael Sep 06 '18 at 08:45