0

I have a small problem to extract the words which are in bold:

Médoc, Rouge
2ème Vin, Margaux, Rosé
2ème vin, Pessac-Léognan, Blanc

I have to clarify more my question : I'm trying to extract some information from web pages, so each time i found a kind of sentence but me i'm interesting in which is in bold. I give you the adress of the tree wab pages :

Any ideas?

xeroxSO
  • 21
  • 5

2 Answers2

2

You can use positive look ahead to see if Rouge or Blanc or Rosé is after the word we are looking for:

>>> import re
>>> l = [u"Médoc, Rouge", u"2ème Vin, Margaux, Rosé", u"2ème vin, Pessac-Léognan, Blanc"]
>>> for s in l:
...     print re.search(ur'([\w-]+)(?=\W+(Rouge|Blanc|Rosé))', s, re.UNICODE).group(0)
... 
Médoc
Margaux
Pessac-Léognan
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Last output should be `Pessac-Léognan` not `Léognan`. – Ashwini Chaudhary Sep 03 '13 at 13:10
  • @MartijnPieters yeah, I think you caught him. – alecxe Sep 03 '13 at 13:12
  • @alecxe -1 on your answer was from me(my connection got yanked so was not able to comment). +1 now ;) – Ashwini Chaudhary Sep 03 '13 at 13:13
  • @alecxe: You mean the serial downvote I just received? *shrug*, that'll be reverted tonight anyway. I am pretty sure who did that in any case. – Martijn Pieters Sep 03 '13 at 13:14
  • @MartijnPieters yeah, he was very upset because of `eval()`. – alecxe Sep 03 '13 at 13:15
  • @alecxe: if it is him, it's a pity he took it personally. I'll just avoid giving feedback next time, he'll just have to figure out on his own why a post got downvoted. – Martijn Pieters Sep 03 '13 at 13:16
  • @MartijnPieters exactly, being explicit just don't work for some people. Actually cannot think of a valid reason to downvote your answers. – alecxe Sep 03 '13 at 13:23
  • thank you for answer but I have to clarify more my question : I'm trying to extract some information from web pages, so each time i found a kind of sentence but me i'm interesting in which is in bold. I give you the adress of the tree wab pages : - [link](http://www.nicolas.com/page.php/fr/18_409_9829_tourprignacgrandereserve.htm) - [link](http://www.nicolas.com/page.php/fr/18_409_8236_relaisdedurfortvivens.htm) - [link] (http://www.nicolas.com/page.php/fr/18_409_9068_leshautsdesmith.htm) – xeroxSO Sep 03 '13 at 13:23
  • @xeroxSO well, that's actually completely changes the question. If the data is inside the web page - you need an HTML parser like `BeautifulSoup` or `lxml`, doing it with regex is not a best practice: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags. – alecxe Sep 03 '13 at 13:26
  • thanks I think it's fixed with expression [\w-]+)(?=\W+(Rouge|Blanc|Rosé) – xeroxSO Sep 03 '13 at 13:31
1

Seems like it's always the second to last term in the comma separated list? You can split and select the second to last, example:

>>> myStr = '2ème vin, Pessac-Léognan, Blanc'
>>> res = myStr.split(', ')[-2]

Otherwise, if you want regex alone... I'll suggest this:

>>> res = re.search(r'([^,]+),[^,]+$', myStr).group(1)

And trim if necessary for spaces.

Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
Jerry
  • 70,495
  • 13
  • 100
  • 144
  • Don't use `str` as a variable name. – Ashwini Chaudhary Sep 03 '13 at 13:10
  • @AshwiniChaudhary Okay, are there perhaps functions having names containing `str`? – Jerry Sep 03 '13 at 13:11
  • `str` is a built-in type in python. – Ashwini Chaudhary Sep 03 '13 at 13:15
  • thank you for answer but I have to clarify more my question : I'm trying to extract some information from web pages, so each time i found a kind of sentence but me i'm interesting in which is in bold. I give you the adress of the tree wab pages : - [link](http://www.nicolas.com/page.php/fr/18_409_9829_tourprignacgrandereserve.htm) - [link](http://www.nicolas.com/page.php/fr/18_409_8236_relaisdedurfortvivens.htm) - [link] (http://www.nicolas.com/page.php/fr/18_409_9068_leshautsdesmith.htm) – xeroxSO Sep 03 '13 at 13:25
  • @xeroxSO Oh, but that changes everything... You might try this one, which looks for the specific title you're looking for: `res = re.search(r'
    .*?\s([^,]+),[^,]+
    ', myPage).group(1)`.
    – Jerry Sep 03 '13 at 14:39