I have the first lines of all English language Wikipedia articles in a Pandas dataframe and I would like to extract the languages mentioned in brackets into a distinct list.
For example:
text
A cat (Afrikaans: kat, German: katze) is an animal.
This line does not contain anything.
A dog (Afrikaans: hond, German: hund, Some language: dog) is an animal.
I would like a list with ['Afrikaans', 'German', 'Some language']
.
Also not sure how to specify a unicode supporting regex for something like df.text.str.extract(r'(\w+):')
Anyone have any ideas on how to do this?