How to pull a list of regex matched references from a column in pandas

Question

I have a dataframe that contains a column with references I'd like to extract to a list.

The references in the rows are mixed with other reference types.

stage, source, references,
open, phone, CAZ101AH.BA6642 2.266.548.55 20-44-255,
open, email, L.22.365.211.3 CAZ125AH.BA0842,
open, phone, 01-02-225 CAZ141AH.BA1342,
open, email, CAP101AH.BA2142,
open, email, 01-56-655,
open, phone, 03-07-555 2.655.585.55
open, email, L.22.365.566.5 CAZ101AH.BA6642,

I'd like to extract a certain type of reference (specifically the ones that start with "CA...")

The regex match for the reference type is '\w+[.]\w{6}'

The closest I've got so far is:

df['references'].str.findall(r'\w+[.]\w{6}').tolist()

which outputs

 ['CAZ101AH.BA6642'],
 ['CAZ125AH.BA0842'],
 ['CAZ141AH.BA1342'],
 ['CAP101AH.BA2142'],
 [],
 [],
 ['CAZ101AH.BA6642'],

How would i extract a list without the brackets, and empty colums for non matched rows etc?

ideally I'd like to generate a list with similar output to .value_counts() with

 CAZ101AH.BA6642 2
 CAZ125AH.BA0842 1
 CAZ141AH.BA1342 1
 CAP101AH.BA2142 1

Would flattening the list, then using `itertools.groupby` be sufficient? — bicarlsen, Feb 17 '21 at 10:28
`[l for x in df['references'].str.findall(r'\w+\.\w{6}').tolist() for l in x]`? — Wiktor Stribiżew, Feb 17 '21 at 10:30
@bicarlsen - I've not been able to flatten the list - would you mind explaining how? the suggest Stack overflow answer (mine has been marked as duplicate) doesnt make sense to me. thanks. — linctuspastilles, Feb 18 '21 at 17:17
@bicarlsen It actually did after I took a break and came back to it. I was able to loop through and append string items to a flattened list then convert to a dataframe for my desired view_counts() output. thanks! — linctuspastilles, Feb 19 '21 at 11:48

How to pull a list of regex matched references from a column in pandas

0 Answers0