0

I have a column with annotations of sentences in IOB format. A row looks roughly like this:

data['labels'][0] = '['O', 'O', 'O', 'B-l1', 'O', 'B-l1', 'I-l2', 'I-l2', 'O', 'I-l2']'

I want to get the unique labels: 'O', 'B-l1', and 'I-l2'. The idea is to remove all rows that are not annotated, meaning the only label in the list is 'O'.

This is my current code:

list(set(data['labels][0]))

But it returns each symbol on a new row:

'O',
'B',
'-',
'l',
'1',
'I',
'2',
','

which is not what I am looking for.

I would appreciate some help here. Thanks.

Yana
  • 785
  • 8
  • 23

2 Answers2

1

To filter your rows, you can use set operations:

S = {'O'}

data[[not S.issuperset(l) for l in data['labels']]]

Example input:

data = pd.DataFrame({'labels': [['O'], ['O', 'B-l1'], []]})

Output:

      labels
1  [O, B-l1]

converting from strings

If you have strings representations of lists:

import ast

data['labels'] = [list(set(ast.literal_eval(l))) for l in data['labels']]
mozway
  • 194,879
  • 13
  • 39
  • 75
  • I need the unique labels per row. And for some reason, I don't get the unique labels with this code even though I copy-pasted it. – Yana Oct 04 '22 at 10:08
  • I thought you wanted to filter the rows. To get unique values: `data['labels'] = [list(set(l)) for l in data['labels']]`. You can then perform filtering of you want both. – mozway Oct 04 '22 at 10:11
  • Yes, this is what I did but the return result is: `'O', 'B', 'l', '1', '2', '-', ','` And when I get specific row it is returned as a list in single quotes. for example, `data['labels'][0]` returns the list in single quotes like this: `'['O', 'O', 'O', 'B-l1', 'O', 'B-l1', 'I-l2', 'I-l2', 'O', 'I-l2']'` – Yana Oct 04 '22 at 10:14
  • 1
    Then your real data is not what you showed. This means that you have a string. You can convert using `ast.literal_eval`. `data['labels'] = [list(set(ast.literal_eval(l))) for l in data['labels']]` – mozway Oct 04 '22 at 10:15
  • lifesaver :) Thank you. I will edit my question, to make it clearer for the other people. Can you add your solution to the answer above? – Yana Oct 04 '22 at 10:19
0

Another possible solution, based on numpy.unique:

lst = ['O', 'O', 'O', 'B-l1', 'O', 'B-l1', 'I-l2', 'I-l2', 'O', 'I-l2']

np.unique(lst).tolist()

Output:

['B-l1', 'I-l2', 'O']
PaulS
  • 21,159
  • 2
  • 9
  • 26