0

I'm supposed to extract groups of text from a file with a top ten list: name, rank, etc. for each. You can see the file and the regex here https://regex101.com/r/fXK5YV/1. It works in there and you can see the capturing groups.

import re

pattern = '''
    (?P<list><li\sclass="regular-search-result">(.|\n)*?(?<=\<span class=\"indexed-biz-name\"\>)
    (?P<rank>\d{1,2})
    (.|\n)*?\<span\>
    (?P<name>.+)
    \<\/span\>(.|\n)*?alt=\"
    (?P<stars>\d\.\d)
    \sstar\srating\"(.|\n)*?\<span class=\"review-count rating-qualifier\"\>(\s|\t|\n)*?
    (?P<numrevs>\d{1,7})(.|\n)*?\<span\sclass=\"business-attribute\sprice-range\">
    (?P<price>\${1,6})
    \<\/span\>(.|\n)*?<\/li>)  
'''

pattern_matcher = re.compile(pattern, re.VERBOSE)

matches = pattern_matcher.match(yelp_html)

This prints None.

There is definitely text inside of yelp_html.

What am I doing wrong?

Ramin Melikov
  • 967
  • 8
  • 14
  • 1
    "What am I doing wrong?" — [Parsing HTML with regular expressions](https://stackoverflow.com/a/1732454/240443). Use tools appropriate to the job, for example [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). – Amadan Feb 10 '20 at 04:50

2 Answers2

0

I see two issues:

  1. You're not using a raw string (prefix the string with an r), which means that your backslashes are going to be trying to represent special things instead of being part of the string.

  2. I believe your multiline string is going to be attempting to match both the newlines between each line and the spaces at the start of the string into your regex (which you don't want, given this is not how the regex is formatted in your link).

Oliver.R
  • 1,282
  • 7
  • 17
  • I'm using a re.VERBOSE method. I did another exercise the same way and it worked fine. I tried adding `r` like you said anyway but it still didn't work. – Ramin Melikov Feb 10 '20 at 03:08
0
import re

pattern = r'''
     (?P<list><li\sclass=\"regular-search-result\">(.|\n)*?(?<=\<span\sclass=\"indexed-biz-name\"\>)
     (?P<rank>\d{1,2})
     (.|\n)*?\<span\>
     (?P<name>.+)
     \<\/span\>(.|\n)*?alt=\"
     (?P<stars>\d\.\d)
     \sstar\srating\"(.|\n)*?\<span\sclass=\"review-count\srating-qualifier\"\>(\s|\t|\n)*?
     (?P<numrevs>\d{1,7})
     (.|\n)*?\<span\sclass=\"business-attribute\sprice-range\">
     (?P<price>\${1,6})
     \<\/span\>(.|\n)*?<\/li>)
'''

pattern_matcher = re.compile(pattern, re.VERBOSE)

matches = pattern_matcher.finditer(yelp_html)

for item in matches:
    print(item.group('rank', 'name', 'stars', 'numrevs', 'price'))
Ramin Melikov
  • 967
  • 8
  • 14