1

I've written a script in python in combnation with re module to get the title of different questions from a webpage. My intention here is not to use BeautifulSoup and still to be able to parse the titles. The way I've used a pattern can do it. However, the output doesn't look so nice. How can I get only the question titles and nothing else.

Here is my try (using re.search()):

import requests
import re

link = "https://stackoverflow.com/questions/tagged/web-scraping"

res = requests.get(link).text
for item in res.splitlines():
    matchitem = re.search(r'hyperlink">(How.+)</a>',item)
    if matchitem:
        print(matchitem.group())

Output I'm getting like (out of several):

hyperlink">How to use Selenium check the checkbox lists?</a>

What I wish to get is like:

How to use Selenium check the checkbox lists?

I'm very new to regex. So, I seek apology in advance, If my question doesn't fit to be a question.

SIM
  • 21,997
  • 5
  • 37
  • 109
  • 2
    Using regex to parse HTML code is initially bad idea. Why you don't want to use BeautifulSoup? You might also check IMHO better option - [lxml.html](https://lxml.de/lxmlhtml.html) – Andersson Jul 08 '18 at 09:24

1 Answers1

2

You just need to use group(1), which gets the first captured subgroup, instead of group(), which gets the entire match.

From the docs:

Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned).

So:

>>> item = 'blah blah hyperlink">How to use Selenium check the checkbox lists?</a> stuff'
>>> matchitem = re.search(r'hyperlink">(How.+)</a>',item)
>>> matchitem
<_sre.SRE_Match object; span=(10, 70), match='hyperlink">How to use Selenium check the checkbox>
>>> matchitem.group()
'hyperlink">How to use Selenium check the checkbox lists?</a>'
>>> matchitem.group(1)
'How to use Selenium check the checkbox lists?'

As a side note:

My intention here is not to use BeautifulSoup and still to be able to parse the titles. The way I've used a pattern can do it.

Really? I can easily construct examples where your regex will do the wrong thing. Even without semi-pathological data, if they push a new minor release of the website on Tuesday that doesn't even touch this part of the code, given that attributes are arbitrarily ordered, the attributes of that a could show up in a different order, and suddenly your search fails, while a trivial BeautifulSoup search still works.

If you're doing this for the purpose of learning regular expressions, that may be fine (although really, HTML is not a great example to use for that), but if you're trying to get actual work done, you're better off using a parser.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • I've to wait 7 minutes to accept your answer @abarnert. – SIM Jul 08 '18 at 09:29
  • This was my first try using any pattern to parse titles. However, could you tell me why it failed (using negative lookbehind and positive lookahead) `'(?<!hyperlink">)How.+(?=)'`? – SIM Jul 08 '18 at 09:45
  • 2
    @asmitu `(?<!hyperlink">)` matches a location that isn't immediately preceded with `hyperlink">`. You wanted to use a positive lookbehind `(?<=hyperlink">)`. But you do not need as it is easy to capture a part of a string and get its value via `.group(n)`. – Wiktor Stribiżew Jul 08 '18 at 09:51
  • Thanks a lot @Wiktor Stribiżew. You just saved me from creating another post by providing with that answer. – SIM Jul 08 '18 at 09:59