Extract a part of a string using Regex in Python Pandas

Question

I'm a student working on a data science project and I need to extract a part from one column of my dataframe. The dataframe looks like this : column.

I want to extract the part HOTHOTVIDEO from a string like "HOTHOTVIDEOHOT0501005107FilmVidéoClub"

So I wrote this instruction using a regex like this : facturation['annotation']=facturation['annotation'].str.findall('([A-Z0-9]{3}\d+)').apply(''.join)

It extracts everything correclty, except sometimes when I have strings like these : "CTVCANALVODCTV0200052670CTV0200052670", it returns CTV0200052670CTV0200052670, but only want the first occurence: Like this

Can someone help me to fix this issue please :)

There was a [similar question here](https://stackoverflow.com/questions/2503413/regular-expression-to-stop-at-first-match) — okpython, Feb 18 '21 at 14:25
It’s not related, though, @okpython . The reason for that problem is the regex pattern itself. The reason for this one is the work done on that pattern. — Arya McCarthy, Feb 18 '21 at 14:32
Why do you use `findall` then? Use `extract`, `.str.extract(r'([A-Z0-9]{3}\d+)')` — Wiktor Stribiżew, Feb 18 '21 at 15:07
I already tried using `extract`, It fixes the problem but it leads to a another problem. That is it can only extract **MFE05** from the strings like MFEMETROPOLITAN**MFE05**UH622455AlaskaHD. That's why I used ```findall``` cuz it returns all the matches. :( — IronBorn, Feb 18 '21 at 15:26
What about `str.extract(r'([A-Z]{1,3}\d{3,})')`? Or `str.extract(r'([A-Z]{2,3}\d{3,})')`? — Wiktor Stribiżew, Feb 18 '21 at 19:44
Please clarify your requirements to make the question answerable. — Wiktor Stribiżew, Feb 18 '21 at 20:20
Thanks @WiktorStribiżew I found the answer to my question. :) — IronBorn, Feb 19 '21 at 12:38

score 0 · Answer 1 · answered Feb 18 '21 at 16:00

0

I think the problem is in your apply + join and findall methods, because you have matched 2 times this pattern in your data and next you have joined it. findall returns for you list. From the list you need only 1st item, not all.

answered Feb 18 '21 at 16:00

matt91t

103
1
8

score 0 · Accepted Answer · answered Feb 19 '21 at 12:37

0

Well thanks everyone who helped me :) I found the answer :

facturation['annotation'] = facturation['annotation'].str.findall('([A-Z0-9]{3}\d+)').apply(''.join)

facturation['annotation'] = facturation['annotation'].str.extract('(.{0,13})')

answered Feb 19 '21 at 12:37

IronBorn

1
1

Extract a part of a string using Regex in Python Pandas

2 Answers2