I am trying to deduplicate the data and also perform an operation on each of the grouped rows by iterating over them.
I group_by each of the label fields using pandas groupby function and then each of them I transform them as columns. Value for each of the label is based on the tokens field by splitting the string on "|". I am able to do it successfully but the performance on large data frame is quite slow.
Iterating over each of the grouped rows in a for loop makes is perform 200 it/s which doesn't scale with large data. So is there a way I can do it faster.
I have tried iterating over the group by values but it is quite slow and I also tried to use np.vectorize but I found it essentially loops over the data.
Eg Following is a dummy data
categories = ["DEF,NAME,ADD"]
id text label tokens id text DEF NAME ADD
1 "abc" DEF X1 | X2 => 1 "abc" [X1,X2] [Y1,Y2] [Z1,Z2]
1 "abc" NAME Y1 | Y2 2 "xyz" [P1, P2] [M1, M2] []
1 "abc" ADD Z1 | Z2
2 "xyz" DEF P1 | P2
2 "xyz" NAME M1 | M2
"Code for deduplicating and mapping to columns"
def deduplicate_data(
df: pd.DataFrame,
categories: List[str],
category_column: str,
token_column: str
)-> pd.DataFrame:
new_columns = list(categories)
new_columns.insert(0, "text")
new_columns.insert(0, "id")
acc = []
new_dataset_length = len(df.groupby("id","text"))
for (item_id, div_text), rows_idx in tqdm(df.groupby([
"id",
"text",
]).groups.items(), total=new_dataset_length):
rows = df.loc[set(rows_idx.tolist())] # selecting the grouped rows
rows = categories_to_list(rows, categories, category_column, token_column)
rows.insert(0, div_text)
rows.insert(0, item_id)
acc.append(rows)
dataset = pd.DataFrame(acc, columns=new_columns)
return dataset
Categories_to_list function converts the selected tokens for the label into a list. I have added only the main function for simplicity.
Iterating over each of the grouped rows in a for loop makes is perform 200 it/s which doesn't scale with large data. So is there a way I can do it faster.
I am expecting it to perform quicker.
EDITED: It might contain duplicate entries for the index with {ID, text and label}.
categories = ["DEF,NAME,ADD"]
id text label tokens id text DEF NAME ADD
1 "abc" DEF X1 | X2 => 1 "abc" [X1,X2] [Y1,Y2] [Z1,Z2]
1 "abc" NAME Y1 | Y2 2 "xyz" [P1, P2, M1, M2] []
1 "abc" ADD Z1 | Z2
2 "xyz" DEF P1 | P2
2 "xyz" DEF M1 | M2
### EDIT 2
Need to make sure output return [] and not None values for newly mapped fields.