0

I am trying to deduplicate the data and also perform an operation on each of the grouped rows by iterating over them.

I group_by each of the label fields using pandas groupby function and then each of them I transform them as columns. Value for each of the label is based on the tokens field by splitting the string on "|". I am able to do it successfully but the performance on large data frame is quite slow.

Iterating over each of the grouped rows in a for loop makes is perform 200 it/s which doesn't scale with large data. So is there a way I can do it faster.

I have tried iterating over the group by values but it is quite slow and I also tried to use np.vectorize but I found it essentially loops over the data.

Eg Following is a dummy data
    categories = ["DEF,NAME,ADD"]
    id  text    label   tokens           id  text   DEF         NAME        ADD
    1   "abc"   DEF     X1 | X2     =>   1   "abc"  [X1,X2]     [Y1,Y2]     [Z1,Z2]
    1   "abc"   NAME    Y1 | Y2          2   "xyz"  [P1, P2]    [M1, M2]    []
    1   "abc"   ADD     Z1 | Z2
    2   "xyz"   DEF     P1 | P2
    2   "xyz"   NAME     M1 | M2
"Code for deduplicating and mapping to columns"
def deduplicate_data(
        df: pd.DataFrame,
        categories: List[str],
        category_column: str,
        token_column: str
)-> pd.DataFrame:

        new_columns = list(categories)
        new_columns.insert(0, "text")
        new_columns.insert(0, "id")
        acc = []
        new_dataset_length = len(df.groupby("id","text"))
        for (item_id, div_text), rows_idx in tqdm(df.groupby([
            "id",
            "text",
        ]).groups.items(), total=new_dataset_length):
            rows = df.loc[set(rows_idx.tolist())]  # selecting the grouped rows
            rows = categories_to_list(rows, categories, category_column, token_column)
            rows.insert(0, div_text)
            rows.insert(0, item_id)
            acc.append(rows)
        dataset = pd.DataFrame(acc, columns=new_columns)
        return dataset

Categories_to_list function converts the selected tokens for the label into a list. I have added only the main function for simplicity.

Iterating over each of the grouped rows in a for loop makes is perform 200 it/s which doesn't scale with large data. So is there a way I can do it faster.

I am expecting it to perform quicker.

EDITED: It might contain duplicate entries for the index with {ID, text and label}.

    categories = ["DEF,NAME,ADD"]
    id  text    label   tokens           id  text   DEF         NAME        ADD
    1   "abc"   DEF     X1 | X2     =>   1   "abc"  [X1,X2]     [Y1,Y2]     [Z1,Z2]
    1   "abc"   NAME    Y1 | Y2          2   "xyz" [P1, P2, M1, M2]    []
    1   "abc"   ADD     Z1 | Z2
    2   "xyz"   DEF     P1 | P2
    2   "xyz"   DEF     M1 | M2

### EDIT 2
Need to make sure output return [] and not None values for newly mapped fields.
xavlock
  • 31
  • 2
  • 11
  • for `xyz` in text column, you have 2 `DEF` why is the expected output has entry under `NAME` ? what do you intend to do when you have duplicates, extend the list or something? – anky Aug 31 '19 at 04:21
  • sorry, I updated, yes I intend to extend the list when there are duplicates... – xavlock Aug 31 '19 at 18:04

1 Answers1

0

From what I understand you can split the tokens column by | and then set_index and unstack():

final=(df.assign(tokens=df.tokens.str.split('|'))
   .set_index(['id','text','label']).unstack())
print(final)

              tokens                        
label            ADD         DEF        NAME
id text                                     
1  "abc"  [Z1 ,  Z2]  [X1 ,  X2]  [Y1 ,  Y2]
2  "xyz"         NaN  [P1 ,  P2]  [M1 ,  M2]

After edit, use:

final=  ((df.assign(tokens=df.tokens.str.split('|')).groupby(['id','text','label']).tokens
.apply(lambda x: [*itertools.chain.from_iterable(x)]).unstack().reset_index()))
print(final)

label  id   text         ADD                   DEF        NAME

0       1  "abc"  [Z1 ,  Z2]            [X1 ,  X2]  [Y1 ,  Y2]
1       2  "xyz"         NaN  [P1 ,  P2, M1 ,  M2]         NaN
anky
  • 74,114
  • 11
  • 41
  • 70
  • thanks for your answer but it might contain duplicate entries... I tried then it throws ValueError with duplicate entries. Updated the post with duplicates example. – xavlock Aug 30 '19 at 20:46
  • what does melt does in this case?.. does it flattens the indexes..? Also * itertools.chain.from_iterable(x) does it flatten if there are multiple list.. ..? – xavlock Sep 01 '19 at 22:32
  • @akshayuppal my bad, melt is not required here. updated the answer and yes, since we group on the tokens, which has duplicate entries, for extending 2 list into a single list, we take help of [`itertools.chain.from_iterable`](https://docs.python.org/2/library/itertools.html#itertools.chain.from_iterable) , the `*` is used to unpack the generator into a list. check [here](https://docs.python.org/3/tutorial/controlflow.html#unpacking-argument-lists) – anky Sep 02 '19 at 05:58
  • is there an alternate to itertools, can we supply a function that basically does the same thing. Proabiliy a simpler approach? – xavlock Sep 04 '19 at 16:35
  • @akshayuppal there is but dont you want a faster approach, this is the faster approach for sure to flatten a list of list , check [this](https://stackoverflow.com/a/40813764/9840637) answer – anky Sep 04 '19 at 16:36
  • thanks for the quick response, I agree it is the fastest approach, there is just a slight change if there are no duplicates, and we use the second group by logic, then do we need itertools. I tried using simple apply df.groupby(['id','text','label']).tokens.apply(lambda x: x).unstack().reset_index() but this doesn't seem to work..? – xavlock Sep 04 '19 at 16:47
  • @akshayuppal i dont understand,the question you had initially asked with sample dataframes has been covered, best would be create a fresh question will all details covered :) – anky Sep 04 '19 at 16:49
  • can we make sure that the values are [] and not None values? I just added to the edit2 – xavlock Sep 04 '19 at 17:06