Merge the rows of a df based on contiguous values of a particular column Pandas

Question

I have a df like this:

    text    labels
0   2083    [CARDINAL (0.8677)]
1   2085    [CARDINAL (0.5846)]
2   1822    [DATE (0.9581)]
3   DHAKA.  [GPE (0.6306)]
4   BANGLADESH  [GPE (0.6535)]
5   2085    [CARDINAL (0.7502)]
6   Manlkganj   [GPE (0.8888)]
7   Bangladesh  [GPE (0.9916)]

What I want is:

        text              labels
0   2083, 2085           CARDINAL
1   1822                  DATE
2   DHAKA. BANGLADESH      GPE
3   2085                 CARDINAL
4   Manlkganj Bangladesh    GPE

Club the contious values of same labels and merge them and make it one row and drop every other row which is not in ls=['GPE', 'ORG', 'CARDINAL'].

I have done it in a more non-pythonic way, looping over the df with df.iterrows() and then df['labels'].str.split('('][0] in ls, takes a lot of time and not getting the proper desired results as well. I was wondering if there's a way to do it more efficiently, a string operation and manipulation of rows.

df to dict 'dict' format to recreate:

{'text': {0: '2083',
  1: '2085',
  2: '1822',
  3: 'DHAKA.',
  4: 'BANGLADESH',
  5: '2085',
  6: 'Manlkganj',
  7: 'Bangladesh',
  8: 'DHAKA',
  9: 'BANGLADESH'},
 'start_pos': {0: 49,
  1: 54,
  2: 107,
  3: 236,
  4: 243,
  5: 355,
  6: 396,
  7: 414,
  8: 540,
  9: 547},
 'end_pos': {0: 53,
  1: 58,
  2: 111,
  3: 242,
  4: 253,
  5: 359,
  6: 405,
  7: 424,
  8: 545,
  9: 557},
 'labels': {0: [CARDINAL (0.8677)],
  1: [CARDINAL (0.5846)],
  2: [DATE (0.9581)],
  3: [GPE (0.6306)],
  4: [GPE (0.6535)],
  5: [CARDINAL (0.7502)],
  6: [GPE (0.8888)],
  7: [GPE (0.9916)],
  8: [GPE (0.5669)],
  9: [GPE (0.878)]}}

Thanks in advance.

You could start with `df.labels = df.labels.apply(lambda x: x.split(' (')[0][1:])` to transform the labels column. — Joshua Voskamp, Oct 27 '21 at 16:53
Your `labels` column seem to contain lists of (one) object. What are `CARDINAL/DATE/GPE`? — Quang Hoang, Oct 27 '21 at 17:00

Joshua Voskamp · Answer 1 · 2021-10-27T17:40:58.157

0

Start with

df.labels = df.labels.apply(lambda x: x[0].split()[0])

Note: if you have a large DataFrame, you're better to figure out how to use df.explode('labels') instead. and then use

g = (df.labels != df.shift().fillna(method='bfill').labels).cumsum().rename('group')
df.groupby(['labels',g]).agg({'text': ', '.join})

Reworked from here: Python/Pandas: Merging Consecutive Rows Only if Matching Columns

And here: Concatenate strings from several rows using Pandas groupby

edited Oct 27 '21 at 17:40

answered Oct 27 '21 at 16:55

Joshua Voskamp

1,855
1
10
13

This coming up with: AttributeError: 'list' object has no attribute 'split' – Strayhorn Oct 27 '21 at 16:58
What is the `type` in the labels column? Is it a `list`, or a `string`? – Joshua Voskamp Oct 27 '21 at 16:59
pandas.core.series.Series – Strayhorn Oct 27 '21 at 17:00
It is not producing the end dataframe which I'm trying to construct. Thank you for your help. – Strayhorn Oct 28 '21 at 07:22

Merge the rows of a df based on contiguous values of a particular column Pandas

1 Answers1