-1

I have the first lines of all English language Wikipedia articles in a Pandas dataframe and I would like to extract the languages mentioned in brackets into a distinct list.

For example:

text
A cat (Afrikaans: kat, German: katze) is an animal.
This line does not contain anything.
A dog (Afrikaans: hond, German: hund, Some language: dog) is an animal.

I would like a list with ['Afrikaans', 'German', 'Some language'].

Also not sure how to specify a unicode supporting regex for something like df.text.str.extract(r'(\w+):')

Anyone have any ideas on how to do this?

Superdooperhero
  • 7,584
  • 19
  • 83
  • 138

1 Answers1

1

Here is my suggestion:

  1. extract text in parentheses as a column
  2. extractall capitalized words from column in 1, grouped to a list
  3. flatten lists from column in 2 and get the distincts

Here goes:

text = \
["A cat (Afrikaans: kat, German: katze) is an animal.", 
"A dog (Afrikaans: hond, German: hund, Some language: dog) is an animal."]

df = pd.DataFrame(text, columns=['text'])
df['in_parentheses'] = df['text'].str.extract("\(([^)]+)\)")
df['languages'] = df['in_parentheses'].str.extractall("([A-Z]\w+)").groupby(level=0)[0].apply(list)

set(sum(df['languages'], []))

got:

{'Afrikaans', 'German', 'Some'}
Ezer K
  • 3,637
  • 3
  • 18
  • 34