0

I have a dataframe that contains a column with references I'd like to extract to a list.

The references in the rows are mixed with other reference types.

stage, source, references,
open, phone, CAZ101AH.BA6642 2.266.548.55 20-44-255,
open, email, L.22.365.211.3 CAZ125AH.BA0842,
open, phone, 01-02-225 CAZ141AH.BA1342,
open, email, CAP101AH.BA2142,
open, email, 01-56-655,
open, phone, 03-07-555 2.655.585.55
open, email, L.22.365.566.5 CAZ101AH.BA6642,

I'd like to extract a certain type of reference (specifically the ones that start with "CA...")

The regex match for the reference type is '\w+[.]\w{6}'

The closest I've got so far is:

df['references'].str.findall(r'\w+[.]\w{6}').tolist()

which outputs

 ['CAZ101AH.BA6642'],
 ['CAZ125AH.BA0842'],
 ['CAZ141AH.BA1342'],
 ['CAP101AH.BA2142'],
 [],
 [],
 ['CAZ101AH.BA6642'],

How would i extract a list without the brackets, and empty colums for non matched rows etc?

ideally I'd like to generate a list with similar output to .value_counts() with

 CAZ101AH.BA6642 2
 CAZ125AH.BA0842 1
 CAZ141AH.BA1342 1
 CAP101AH.BA2142 1
 
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563

0 Answers0