0

I have a df like this:

    text    labels
0   2083    [CARDINAL (0.8677)]
1   2085    [CARDINAL (0.5846)]
2   1822    [DATE (0.9581)]
3   DHAKA.  [GPE (0.6306)]
4   BANGLADESH  [GPE (0.6535)]
5   2085    [CARDINAL (0.7502)]
6   Manlkganj   [GPE (0.8888)]
7   Bangladesh  [GPE (0.9916)]

What I want is:

        text              labels
0   2083, 2085           CARDINAL
1   1822                  DATE
2   DHAKA. BANGLADESH      GPE
3   2085                 CARDINAL
4   Manlkganj Bangladesh    GPE

Club the contious values of same labels and merge them and make it one row and drop every other row which is not in ls=['GPE', 'ORG', 'CARDINAL'].

I have done it in a more non-pythonic way, looping over the df with df.iterrows() and then df['labels'].str.split('('][0] in ls, takes a lot of time and not getting the proper desired results as well. I was wondering if there's a way to do it more efficiently, a string operation and manipulation of rows.

df to dict 'dict' format to recreate:

{'text': {0: '2083',
  1: '2085',
  2: '1822',
  3: 'DHAKA.',
  4: 'BANGLADESH',
  5: '2085',
  6: 'Manlkganj',
  7: 'Bangladesh',
  8: 'DHAKA',
  9: 'BANGLADESH'},
 'start_pos': {0: 49,
  1: 54,
  2: 107,
  3: 236,
  4: 243,
  5: 355,
  6: 396,
  7: 414,
  8: 540,
  9: 547},
 'end_pos': {0: 53,
  1: 58,
  2: 111,
  3: 242,
  4: 253,
  5: 359,
  6: 405,
  7: 424,
  8: 545,
  9: 557},
 'labels': {0: [CARDINAL (0.8677)],
  1: [CARDINAL (0.5846)],
  2: [DATE (0.9581)],
  3: [GPE (0.6306)],
  4: [GPE (0.6535)],
  5: [CARDINAL (0.7502)],
  6: [GPE (0.8888)],
  7: [GPE (0.9916)],
  8: [GPE (0.5669)],
  9: [GPE (0.878)]}}

Thanks in advance.

Strayhorn
  • 687
  • 6
  • 16

1 Answers1

0

Start with

df.labels = df.labels.apply(lambda x: x[0].split()[0])

Note: if you have a large DataFrame, you're better to figure out how to use df.explode('labels') instead. and then use

g = (df.labels != df.shift().fillna(method='bfill').labels).cumsum().rename('group')
df.groupby(['labels',g]).agg({'text': ', '.join})

Reworked from here: Python/Pandas: Merging Consecutive Rows Only if Matching Columns

And here: Concatenate strings from several rows using Pandas groupby

Joshua Voskamp
  • 1,855
  • 1
  • 10
  • 13