the right regex expression in python

Question

I have a small problem to extract the words which are in bold:

Médoc, Rouge
2ème Vin, Margaux, Rosé
2ème vin, Pessac-Léognan, Blanc

I have to clarify more my question : I'm trying to extract some information from web pages, so each time i found a kind of sentence but me i'm interesting in which is in bold. I give you the adress of the tree wab pages :

(http://www.nicolas.com/page.php/fr/18_409_9829_tourprignacgrandereserve.htm)
(http://www.nicolas.com/page.php/fr/18_409_8236_relaisdedurfortvivens.htm)

re(r'\s*\w+-\w+-\w+|\w+-\w+|\w+[^Rouge,Blanc,Rosé]')

Any ideas?

What's the criteria here? – Ashwini Chaudhary Sep 03 '13 at 12:55 — Ashwini Chaudhary, Sep 03 '13 at 12:55

alecxe · Answer 1 · 2013-09-03T13:11:25.477

2

You can use positive look ahead to see if Rouge or Blanc or Rosé is after the word we are looking for:

>>> import re
>>> l = [u"Médoc, Rouge", u"2ème Vin, Margaux, Rosé", u"2ème vin, Pessac-Léognan, Blanc"]
>>> for s in l:
...     print re.search(ur'([\w-]+)(?=\W+(Rouge|Blanc|Rosé))', s, re.UNICODE).group(0)
... 
Médoc
Margaux
Pessac-Léognan

edited Sep 03 '13 at 13:11

answered Sep 03 '13 at 13:01

alecxe

462,703
120
1,088
1,195

Last output should be `Pessac-Léognan` not `Léognan`. – Ashwini Chaudhary Sep 03 '13 at 13:10
@MartijnPieters yeah, I think you caught him. – alecxe Sep 03 '13 at 13:12
@alecxe -1 on your answer was from me(my connection got yanked so was not able to comment). +1 now ;) – Ashwini Chaudhary Sep 03 '13 at 13:13
@alecxe: You mean the serial downvote I just received? *shrug*, that'll be reverted tonight anyway. I am pretty sure who did that in any case. – Martijn Pieters Sep 03 '13 at 13:14
@MartijnPieters yeah, he was very upset because of `eval()`. – alecxe Sep 03 '13 at 13:15
@alecxe: if it is him, it's a pity he took it personally. I'll just avoid giving feedback next time, he'll just have to figure out on his own why a post got downvoted. – Martijn Pieters Sep 03 '13 at 13:16
@MartijnPieters exactly, being explicit just don't work for some people. Actually cannot think of a valid reason to downvote your answers. – alecxe Sep 03 '13 at 13:23
thank you for answer but I have to clarify more my question : I'm trying to extract some information from web pages, so each time i found a kind of sentence but me i'm interesting in which is in bold. I give you the adress of the tree wab pages : - [link](http://www.nicolas.com/page.php/fr/18_409_9829_tourprignacgrandereserve.htm) - [link](http://www.nicolas.com/page.php/fr/18_409_8236_relaisdedurfortvivens.htm) - [link] (http://www.nicolas.com/page.php/fr/18_409_9068_leshautsdesmith.htm) – xeroxSO Sep 03 '13 at 13:23
@xeroxSO well, that's actually completely changes the question. If the data is inside the web page - you need an HTML parser like `BeautifulSoup` or `lxml`, doing it with regex is not a best practice: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags. – alecxe Sep 03 '13 at 13:26
thanks I think it's fixed with expression [\w-]+)(?=\W+(Rouge|Blanc|Rosé) – xeroxSO Sep 03 '13 at 13:31

score 1 · Answer 2 · edited Sep 03 '13 at 13:20

1

Seems like it's always the second to last term in the comma separated list? You can split and select the second to last, example:

>>> myStr = '2ème vin, Pessac-Léognan, Blanc'
>>> res = myStr.split(', ')[-2]

Otherwise, if you want regex alone... I'll suggest this:

>>> res = re.search(r'([^,]+),[^,]+$', myStr).group(1)

And trim if necessary for spaces.

edited Sep 03 '13 at 13:20

Ashwini Chaudhary

244,495
58
464
504

answered Sep 03 '13 at 12:56

Jerry

70,495
13
100
144

Don't use `str` as a variable name. – Ashwini Chaudhary Sep 03 '13 at 13:10
@AshwiniChaudhary Okay, are there perhaps functions having names containing `str`? – Jerry Sep 03 '13 at 13:11
`str` is a built-in type in python. – Ashwini Chaudhary Sep 03 '13 at 13:15
thank you for answer but I have to clarify more my question : I'm trying to extract some information from web pages, so each time i found a kind of sentence but me i'm interesting in which is in bold. I give you the adress of the tree wab pages : - [link](http://www.nicolas.com/page.php/fr/18_409_9829_tourprignacgrandereserve.htm) - [link](http://www.nicolas.com/page.php/fr/18_409_8236_relaisdedurfortvivens.htm) - [link] (http://www.nicolas.com/page.php/fr/18_409_9068_leshautsdesmith.htm) – xeroxSO Sep 03 '13 at 13:25
@xeroxSO Oh, but that changes everything... You might try this one, which looks for the specific title you're looking for: `res = re.search(r'
.*?\s([^,]+),[^,]+
', myPage).group(1)`. – Jerry Sep 03 '13 at 14:39

the right regex expression in python

2 Answers2