Python Re.Search: How to find a substring between two strings, that must also contain a specific substring

Question

I am writing a little script to get my F@H user data from a basic HTML page.

I want to locate my username on that page and the numbers before and after it.

All the data I want is between two HTML <tr> and </tr> tags.

I am currently using this:

re.search(r'<tr>(.*?)</tr>', htmlstring)

I know this works for any substring, as all google results for my question show. The difference here is i need it only when that substring also contains a specific word

However that only returns the first string between those two delimiters, not even all of them.

This pattern occurs hundreds of times on the page. I suspect it doesn't get them all because I'm not handling all the newline characters correctly but I'm not sure.

If it would return all of them, I could at least then sort them out to find one that contains my username going through each result.group(), but I can't even do that.

I have been fiddling with different regex expressions for ages now but can't figure what one I need to much frustration.

TL;DR - I need a re.search() pattern that finds a substring between two words, that also contains a specific word.

https://stackoverflow.com/questions/58124584/python-find-a-substring-between-two-strings-based-on-the-last-occurence-of-the — Je Je, Jun 04 '20 at 00:04
it may not be the best way to proceed, can you share the webpage url? — Je Je, Jun 04 '20 at 00:06
depending of the case you might want to this: https://stackoverflow.com/questions/57578730/find-a-tag-using-text-it-contains-using-beautifulsoup — Je Je, Jun 04 '20 at 00:07
@NonoLondon Thanks for responding but what you linked me to first there is what I get everywhere when I google my problem. I know how to get any substring between two points as my code is already doing that. I need to also limit it to when that specific contains a specific word. — spekofthedevil, Jun 04 '20 at 00:10
ok and what about the second example? can you shar e webpage so that i can have a think? — Je Je, Jun 04 '20 at 00:11
@NonoLondon that 2nd suggestion may be what im looking for alright. Different approach using BeautifulSoup's CSS tag extraction. That might make more sense here instead of me reinventing the wheel. Thank you. — spekofthedevil, Jun 04 '20 at 00:12
@NonoLondon webpage URL is https://apps.foldingathome.org/teamstats/team3446.html — spekofthedevil, Jun 04 '20 at 00:13

score 0 · Answer 1 · answered Jun 04 '20 at 00:28

If I understand correctly something like this might work
<tr>(?:(?:(?:(?!<\/tr>).)*?)\bWORD\b(?:.*?))<\/tr>

<tr> find "<tr>"
(?:(?:(?!<\/tr>).)*?) Find anything except "</tr>" as few times as possible
\bWORD\b find WORD
(?:.*?)) find anything as few times as possible
<\/tr> find "</tr>"

Sample

Je Je · Answer 2 · 2020-06-04T01:54:47.810

There are a few ways to do it but I prefer the pandas way:


from urllib import request

import pandas as pd # you need to install pandas

base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'

web_request = request.urlopen(url=base_url).read()

web_df: pd.DataFrame = pd.read_html(web_request, attrs={'class': 'members'})
web_df = web_df[0].set_index(keys=['Name'])
# print(web_df)

user_name_to_find_in_table = 'SteveMoody'
user_name_df = web_df.loc[user_name_to_find_in_table]
print(user_name_df)

Then there are plenty of ways to do this. Using just Beautifulsoup find or css selectors, or maybe re as Peter suggest?

Using beautifulsoup and "find" method, and re, you can do it the following way:

import re
from bs4 import BeautifulSoup as bs # you need to install beautifullsoup
from urllib import request




base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'

web_request = request.urlopen(url=base_url).read()

page_soup = bs(web_request, 'lxml') # need to install lxml and bs4(beautifulsoup for Python 3+)

user_name_to_find_in_table = 'SteveMoody'

row_tag = page_soup.find(
    lambda t: t.name == "td"
              and re.findall(user_name_to_find_in_table, t.text, flags=re.I)
).find_parent(name="tr")

print(row_tag.get_text().strip('tr'))

Using Beautifulsoup and CSS Selectors(no re but Beautifulsoup):

from bs4 import BeautifulSoup as bs # you need to install beautifulsoup
from urllib import request


base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'

web_request = request.urlopen(url=base_url).read()

page_soup = bs(web_request, 'lxml') # need to install lxml and bs4(beautifulsoup for Python 3+)

user_name_to_find_in_table = 'SteveMoody'

row_tag = page_soup.select_one(f'tr:has(> td:contains({user_name_to_find_in_table})) ')

print(row_tag.get_text().strip('tr'))

In your case I would favor the pandas example as you keep headers and can easily get other stats, and it runs very quickly.

Using Re:

So fa, best input is Peters' commentLink, so I just adapted it to Python code (happy to get edited), as this solution doesn't need any extra libraries installation.

import re
from urllib import request




base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'

web_request = request.urlopen(url=base_url).read()
user_name_to_find_in_table = 'SteveMoody'
re_patern = rf'<tr>(?:(?:(?:(?!<\/tr>).)*?)\{user_name_to_find_in_table}\b(?:.*?))<\/tr>'
res = re.search(pattern=re_patern, string= str(web_request))

print(res.group(0))

Helpful lin to use variables in regex: stackflow

Python Re.Search: How to find a substring between two strings, that must also contain a specific substring

2 Answers2