-2

I'm a student working on a data science project and I need to extract a part from one column of my dataframe. The dataframe looks like this : column.

I want to extract the part HOTHOTVIDEO from a string like "HOTHOTVIDEOHOT0501005107FilmVidéoClub"

So I wrote this instruction using a regex like this : facturation['annotation']=facturation['annotation'].str.findall('([A-Z0-9]{3}\d+)').apply(''.join)

It extracts everything correclty, except sometimes when I have strings like these : "CTVCANALVODCTV0200052670CTV0200052670", it returns CTV0200052670CTV0200052670, but only want the first occurence: Like this

Can someone help me to fix this issue please :)

IronBorn
  • 1
  • 1
  • There was a [similar question here](https://stackoverflow.com/questions/2503413/regular-expression-to-stop-at-first-match) – okpython Feb 18 '21 at 14:25
  • 1
    It’s not related, though, @okpython . The reason for that problem is the regex pattern itself. The reason for this one is the work done on that pattern. – Arya McCarthy Feb 18 '21 at 14:32
  • Why do you use `findall` then? Use `extract`, `.str.extract(r'([A-Z0-9]{3}\d+)')` – Wiktor Stribiżew Feb 18 '21 at 15:07
  • I already tried using `extract`, It fixes the problem but it leads to a another problem. That is it can only extract **MFE05** from the strings like MFEMETROPOLITAN**MFE05**UH622455AlaskaHD. That's why I used ```findall``` cuz it returns all the matches. :( – IronBorn Feb 18 '21 at 15:26
  • What about `str.extract(r'([A-Z]{1,3}\d{3,})')`? Or `str.extract(r'([A-Z]{2,3}\d{3,})')`? – Wiktor Stribiżew Feb 18 '21 at 19:44
  • Please clarify your requirements to make the question answerable. – Wiktor Stribiżew Feb 18 '21 at 20:20
  • Thanks @WiktorStribiżew I found the answer to my question. :) – IronBorn Feb 19 '21 at 12:38

2 Answers2

0

I think the problem is in your apply + join and findall methods, because you have matched 2 times this pattern in your data and next you have joined it. findall returns for you list. From the list you need only 1st item, not all.

matt91t
  • 103
  • 1
  • 8
0

Well thanks everyone who helped me :) I found the answer :

facturation['annotation'] = facturation['annotation'].str.findall('([A-Z0-9]{3}\d+)').apply(''.join)

facturation['annotation'] = facturation['annotation'].str.extract('(.{0,13})')

IronBorn
  • 1
  • 1