Pandas - Distinct list of values from Pandas column's regex groups

Question

I have the first lines of all English language Wikipedia articles in a Pandas dataframe and I would like to extract the languages mentioned in brackets into a distinct list.

For example:

text
A cat (Afrikaans: kat, German: katze) is an animal.
This line does not contain anything.
A dog (Afrikaans: hond, German: hund, Some language: dog) is an animal.

I would like a list with ['Afrikaans', 'German', 'Some language'].

Also not sure how to specify a unicode supporting regex for something like df.text.str.extract(r'(\w+):')

Anyone have any ideas on how to do this?

Do you need to support unicode or do you just need the output you've listed from your input? Your example doesn't make it clear. — Nick Becker, Mar 01 '20 at 20:16
I'm assuming lots of language names would have funny characters in them, so basically grab anything before the `:` including unicode characters and spaces. — Superdooperhero, Mar 01 '20 at 20:18
Python3 supports unicode, so you could be explicit and enumerate the possibilities. Alternatively, you could implement your logic of "grab anything before the colon", after a comma, and between the parentheses — Nick Becker, Mar 01 '20 at 20:23
With regex you normally have to say something like `re.UNICODE`, I'm saying I'm not sure how to do that with Pandas — Superdooperhero, Mar 01 '20 at 20:26
Does the `flags` argument in `Series.str.extract` not work? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html — Nick Becker, Mar 01 '20 at 20:29

Ezer K · Accepted Answer · 2020-03-02T11:22:58.383

Here is my suggestion:

extract text in parentheses as a column
extractall capitalized words from column in 1, grouped to a list
flatten lists from column in 2 and get the distincts

Here goes:

text = \
["A cat (Afrikaans: kat, German: katze) is an animal.", 
"A dog (Afrikaans: hond, German: hund, Some language: dog) is an animal."]

df = pd.DataFrame(text, columns=['text'])
df['in_parentheses'] = df['text'].str.extract("\(([^)]+)\)")
df['languages'] = df['in_parentheses'].str.extractall("([A-Z]\w+)").groupby(level=0)[0].apply(list)

set(sum(df['languages'], []))

got:

{'Afrikaans', 'German', 'Some'}

Pandas - Distinct list of values from Pandas column's regex groups

1 Answers1