0

I have a Pandas DataFrame which looks like so:

   user_id  item_timestamp                item_cashtags                                       item_sectors                                    item_industries
0   406225      1483229353                          SPY                                          Financial                               Exchange Traded Fund
1   406225      1483229353                          ERO                                          Financial                               Exchange Traded Fund
2   406225      1483229350  CAKE|IWM|SDS|SPY|X|SPLK|QQQ  Services|Financial|Financial|Financial|Basic M...  Restaurants|Exchange Traded Fund|Exchange Trad...
3   619769      1483229422                         AAPL                                         Technology                                 Personal Computers
4   692735      1483229891                         IVOG                                          Financial                               Exchange Traded Fund

I'd like to split the cashtags, sectors and industries columns by |. Each cashtag corresponds to a sector which corresponds to an industry, so they are of equal amounts.

I'd like the output to be such that each cashtag, sector and industry have their own row, with the item_timestamp and user_id copying over, ie:

   user_id  item_timestamp                item_cashtags              item_sectors                                    item_industries
2   406225      1483229350               CAKE|IWM|SDS               Services|Financial|Financial        Restaurants|Exchange Traded Fund|Exchange Traded Fund

would become:

 user_id  item_timestam       item_cashtags         item_sectors              item_industries
406225      1483229350          CAKE                Services                    Restaurants
406225      1483229350          IWM                 Financial                   Exchange Traded Fund
406225      1483229350          SDS                 Financial                   Exchange Traded Fund

My problem is that this is a conditional split which I'm not sure how to do in Pandas

apgsov
  • 794
  • 1
  • 8
  • 30
  • How does that work for `item_industries`? – Roelant Mar 05 '19 at 15:02
  • @Roelant I'm not sure what you mean, `item_industries` has the same format as `item_sectors` and `item_cashtags`, each item separated by a pipe corresponds to each other. – apgsov Mar 05 '19 at 15:08
  • you could have more lines with pipe? – Frenchy Mar 05 '19 at 15:21
  • 1
    The following answer might help you: https://stackoverflow.com/questions/12680754/split-explode-pandas-dataframe-string-entry-to-separate-rows/40449726#40449726 – sibbl Mar 05 '19 at 15:27
  • @sibbl This worked perfectly, thanks. All I did was apply a lambda conditional split to my rows containing pipes then passed it through the explode function in this link. If I could accept your comment as the accepted answer, I would. – apgsov Mar 05 '19 at 16:10

2 Answers2

0

If the frame is not to large, one easy option is to just loop through the rows. But I agree it is not the most pandamic way to do it, and definitely not the most performing one.

from copy import copy
result = []  
for idx, row in df.iterrows():
    d = dict(row)
    for cat1, cat2 in zip(d['cat1'].split('|'), d['cat2'].split('|')): 
        # here you can add an if to filter on certain categories 
        dd = copy(d)
        dd['cat1'] = cat1
        dd['cat2'] = cat2
        result.append(dd)
pd.DataFrame(result)  # convert back 
Roelant
  • 4,508
  • 1
  • 32
  • 62
0

Okay, I don't know how performant this will be, but here's another approach

# test_data
df_dict = {
    "user_id": [406225, 406225],
    "item_timestamp": [1483229350, 1483229353],
    "item_cashtags": ["CAKE|IWM|SDS", "SPY"],
    "item_sectors": ["Services|Financial|Financial", "Financial"],
    "item_industries": [
        "Restaurants|Exchange Traded Fund|Exchange Traded Fund",
        "Exchange Traded Fund"
    ]
}
df = pd.DataFrame(df_dict)

# which columns to split; all others should be "copied" over
split_cols = ["item_cashtags", "item_sectors", "item_industries"]
copy_cols = [col for col in df.columns if col not in split_cols]

# for each column, split on |. This gives a list, so values is an array of lists
# summing values concatenates these into one long list
new_df_dict = {col: df[col].str.split("|").values.sum() for col in split_cols}

# n_splits tells us how many times to replicate the values from the copied columns
# so that they'll match with the new number of rows from splitting the other columns
n_splits = df.item_cashtags.str.count("\|") + 1
# we turn each value into a list so that we can easily replicate them the proper
# number of times, then concatenate these lists like with the split columns
for col in copy_cols:
    new_df_dict[col] = (df[col].map(lambda x: [x]) * n_splits).values.sum()

# now make a df back from the dict of columns
new_df = pd.DataFrame(new_df_dict)

# new_df
#   item_cashtags item_sectors item_industries      user_id item_timestamp
# 0 CAKE          Services     Restaurants          406225  1483229350
# 1 IWM           Financial    Exchange Traded Fund 406225  1483229350
# 2 SDS           Financial    Exchange Traded Fund 406225  1483229350
# 3 SPY           Financial    Exchange Traded Fund 406225  1483229353
Nathan
  • 9,651
  • 4
  • 45
  • 65