3

I have a very large string and I like to find a small string or value inside it (in my example 14). A snippet of it looks like this:

I need to retrieve 14. The catch is that 78 is dynamic and I get it's value from a dict (someDict)

str1='dnas  ANYTHING Here <td class="tr js-name"><a href="/myportal/report/78/abc/xyz/14" title="balh">blah</a></td>'

str2="/myportal/report/"+str(someDict["Id"])+"/abc/xyz/"

p = re.compile(r'str2\s*(.*?)\"')
match = p.search(str1)
if match:
    print(match.group(1))
else:
    print("cant find it")

I know there is something wrong with --> p = re.compile(r'str2\s*(.*?)\"') since I cant just stick in str2, how do I go about using compile please

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Ghost
  • 549
  • 8
  • 29

1 Answers1

5

The string you are parsing looks like HTML, regular expressions is not exactly the best tool for the job. I would a more specialized tool - an HTML parser, like BeautifulSoup:

from urllib.parse import urlparse

from bs4 import BeautifulSoup


data = 'dnas  ANYTHING Here <td class="tr js-name"><a href="/myportal/report/78/abc/xyz/14" title="balh">blah</a></td>'

soup = BeautifulSoup(data, "html.parser")
href = soup.select_one("td.tr.js-name > a")["href"]

parsed_url = urlparse(href)
print(parsed_url.path.split("/")[-1])

Prints 14.

Note that here td.tr.js-name > a is a CSS selector that is one of the techniques you could use to locate elements in the HTML:

  • > denotes a direct parent->child relationship
  • td.tr.js-name would match a td element having tr and js-name class values
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • This is really cool thanks alecxe, so there is only one: td.tr.js-name > a in html string right? – Ghost Dec 20 '18 at 16:53
  • @user3556956 yeah, `.select_one()` would locate a single element only. `.select()` would locate multiple elements. BeautifulSoup is pretty flexible overall and there are tons of different ways to get to the desired elements. – alecxe Dec 20 '18 at 16:55