0

I have the following dataframe:

dict_df = {'sent_id': {11: 3,
  12: 3,
  24: 7,
  25: 7,
  26: 7,
  27: 7,
  28: 7,
  29: 8,
  124: 15,
  125: 15,
  126: 15,
  133: 15,
  134: 15,
  135: 15,
  357: 26,
  358: 26,
  359: 26},
 'entity': {11: 'Zhao',
  12: 'Li',
  24: 'Beijing',
  25: 'PKU',
  26: 'Chinafront',
  27: 'Technology',
  28: 'Co.,',
  29: 'Ltd.',
  124: 'January',
  125: '1,',
  126: '2006',
  133: 'December',
  134: '31,',
  135: '2006.',
  357: 'RMB',
  358: '37,560',
  359: 'Yuan'},
 'label': {11: 'B-lessor',
  12: 'I-lessor',
  24: 'B-lessee',
  25: 'I-lessee',
  26: 'I-lessee',
  27: 'I-lessee',
  28: 'I-lessee',
  29: 'I-lessee',
  124: 'B-start_date',
  125: 'I-start_date',
  126: 'I-start_date',
  133: 'B-end_date',
  134: 'I-end_date',
  135: 'I-end_date',
  357: 'B-lease_payment',
  358: 'I-lease_payment',
  359: 'I-lease_payment'}}

And I want to concatenate it back to full sentence by sent_id with " " separator.

I have tried with:

import pandas as pd

df = pd.DataFrame(df_dict)

df.groupby("sent_id").agg(" ".join) but I need to aggregate it to be:

January 1, 2006

What should I change to add separator or maybe there is a simpler method to do the same with unique values only in the label column.

SteveS
  • 3,789
  • 5
  • 30
  • 64

1 Answers1

0

You can do, relying on Concatenate strings from several rows using Pandas groupby:

' '.join(df.groupby('sent_id')['entity'].transform(lambda row: ' '.join(row)).drop_duplicates())

which will give you:

Zhao Li Beijing PKU Chinafront Technology Co., Ltd. January 1, 2006 December 31, 2006. RMB 37,560 Yuan
zabop
  • 6,750
  • 3
  • 39
  • 84
  • I am not sure if this is what you want, could you point out what I misunderstand about your question, if I do miss something? – zabop Oct 17 '20 at 18:59
  • Thanks for your answer dear @zobop! Unfortunately, this is not exactly what I want. I need to my final dataframe to look like: ```sent_id entity label 1 January 1, 2006 here are the unique labels``` – SteveS Oct 17 '20 at 19:22
  • I mean I need to aggregate by id and concatenate the entities into one sentence and collect the labels into unique set. I have solved it with 2 steps: ```from nltk.tokenize.treebank import TreebankWordDetokenizer test_df[test_df.label != "O"].loc[:,["sent_id", "entity"]].groupby(["sent_id"]).agg(TreebankWordDetokenizer().detokenize) test_df[test_df.label != "O"].loc[:,["sent_id", "entity"]].groupby(["sent_id"]).agg(TreebankWordDetokenizer().detokenize)``` Then join by id. – SteveS Oct 17 '20 at 19:24