-1

I am trying to match only the <a> </a> tag of the string below to "Services Team Members - Ryde".

<a href="/cmp/_/job?jk=3711c253b2f3ccef&amp;tk=1a1dof">Services Team Members - Ryde</a>

The Challenge is to EXCLUDE the random string after "...p/_/job?" Currently my solution include the random in the result

<a href="/cmp/_/job\?(.*)>(.*)</a>  

I have looked into lookarounds but could not get them to work

http://www.regular-expressions.info/lookaround.html

Julian Wise
  • 380
  • 1
  • 3
  • 16

1 Answers1

1

Don't (ever) parse HTML with regular expressions. Use a parser.

There is a nice HTML parser available for Python called PyQuery and another one called BeautifulSoup. Use one of them.

from pyquery import PyQuery as pq

doc = pq(url="http://your_url/")
link = doc("a:contains('Services Team Members - Ryde')")

print(link.attr("href"))

prints

'/cmp/_/job?jk=3711c253b2f3ccef&tk=1a1dof'

And before you are tempted, don't parse a URL with regular expressions either. Use a parser.

from urlparse import urlparse, parse_qs

url = urlparse('/cmp/_/job?jk=3711c253b2f3ccef&tk=1a1dof')
params = parse_qs(url.query)

print(params)

prints

{'tk': ['1a1dof'], 'jk': ['3711c253b2f3ccef']}
Community
  • 1
  • 1
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • While this a respected view within the community, there I fin dthere are inherent pros and cons to using HTML parsers with HTML as compared to regular expressions. – Julian Wise Apr 02 '16 at 09:55
  • 2
    No, there are only cons. And a whole bunch of people who deny it because they think that they are so good with regular expressions that *they* are the exception from the rule. But I am not here to discuss this. If you must use regular expressions, it's your funeral. I'm only here to tell you that regular expressions are not the right tool for the job - and that insisting on using a wrong tool for a job when the proper tool is available and easy to use is silly. – Tomalak Apr 02 '16 at 10:02
  • Disagree. Matching subsections of patterns within a body of text on the page, regex can be easier to use. Ie accessing on a page "...Find out more about jobs at ...". With BS4 you'd access

    split into tokens and cycle through the tokens which is more tedious than regexing anywhere on the page mid text. The main purpose of this post is to understand how to implement 'lookarounds' in Python.

    – Julian Wise Apr 02 '16 at 10:19
  • Not buying that, sorry. The main purpose of this post is you searching for a regex to parse HTML with. That lookaround thing is a red herring. See "XY-problem" - you decided that "lookarounds" must be the solution before even asking the question and your question of course asks how to do $TASK with lookarounds, while completely ignoring that what you *actually* want to do $TASK, not "lookarounds". You can disagree with me all you like. I even gave you working sample code that is unarguably simpler (easier to maintain) than your approach (and it works!). – Tomalak Apr 02 '16 at 10:26
  • 2
    Let me sum this up: *"I want to hang a picture. So I have this nail, and a beer bottle, but it keeps breaking when I hit the nail with it. How can I get the nail in the wall with that beer bottle?"* - "Beer bottles are the wrong tool for the job, use a hammer. It will be easier." - *"Hm. But I read there is a bit of controversy here. Some people say that beer bottles work well to drive nails into walls. And since all sides in an argument are of course always equal, I insist on doing it with a beer bottle."* - "Do it, if you must. It's still wrong." - *"I disagree"*. – Tomalak Apr 02 '16 at 10:34