1

This passed on https://regex101.com/ without any issues. Did I miss anything? The entire string is in one line.

def get_title_and_content(html):
  html = """<!DOCTYPE html>     <html>       <head>       <title>Change delivery date with Deliv</title>       </head>       <body>       <div class="gkms web">The delivery date can be changed up until the package is assigned to a driver.</div>       </body>     </html>  """
  title_pattern = re.compile(r'<title>(.*?)</title>(.*)')
  match = title_pattern.match(html)
  if match:
    print('successfully extract title and answer')
      return match.groups()[0].strip(), match.groups()[1].strip()
    else:
      print('unable to extract title or answer')
Yang
  • 6,682
  • 20
  • 64
  • 96
  • 1
    Replace 'match = title_pattern.match(html)' with 'match = title_pattern.search(html)' Maybe? – Tom May 30 '18 at 22:45
  • Interesting! It works. Why does that matter? – Yang May 30 '18 at 22:48
  • 2
    I would not recommend parsing ***HTML*** with *Regex**, maybe take a look at [THIS](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). On a small scale it might not be too bad..... but on anything more than what you have above I would recommend finding another tool for HTML. – PixelEinstein May 30 '18 at 22:51
  • 3
    @Yang Because [`re.match`](https://docs.python.org/2/library/re.html#re.match) only matches from the beginning of the provided string. If you want to get matches anywhere on the text (such as your specific use case), you need to use [`re.search`](https://docs.python.org/2/library/re.html#re.search). Refer to [search() vs. match()](https://docs.python.org/2/library/re.html#search-vs-match) for more information. – Matias Cicero May 30 '18 at 22:53
  • 1
    Or change your regexp to `r'.*(.*?)(.*)'` – Gelineau May 30 '18 at 22:55
  • You could also use .findall() which i think returns a list – Tom May 30 '18 at 22:58
  • 2
    To follow up on what @PixelEinstein said: using `BeautifulSoup`, this whole thing (including parsing the answer out of that mess, of tags which includes part of the head and all of the body, which I assume you were planning to get to once you finished this part?) is `return soup.title.text, soup.div.text`. – abarnert May 30 '18 at 23:20

2 Answers2

0

In a summary of the comments:

title_pattern.search(html) Should be used instead of title_pattern.match(html)

As the search function will search anywhere in the provided string instead of just from the beginning. match = title_pattern.findall(html) could be used similarly but would return a list of items instead of just one.

Also as mentioned using BeautifulSoup would pay of more in the long run as Regular Expression is not properly suited for searching HTML

Tom
  • 685
  • 8
  • 17
0

The comments are correct, re.match() searches from the beginning. That being said, insert a .* into your regex to search from the beginning:

title_pattern = re.compile(r'.*<title>(.*?)</title>(.*)') 
Boergler
  • 191
  • 1
  • 5