Python Regex Match Line If Ends With?

Question

This is what im trying to scrape:

        <p>Some.Title.html<br />
<a href="https://www.somelink.com/yep.html" rel="nofollow">https://www.somelink.com/yep.html</a><br />
Some.Title.txt<br />
<a href="https://www.somelink.com/yeppers.txt" rel="nofollow">https://www.somelink.com/yeppers.txt</a><br />

I have tried several variations of the following:

match = re.compile('^(.+?)<br \/><a href="https://www.somelink.com(.+?)">',re.DOTALL).findall(html)

I am looking to match lines with the "p" tag and without. "p" tag only occurs on the first instance. Terrible at python so I am pretty rusty, have searched through here and google and nothing seemed to be quite the same. Thanks for any help. Really do appreciate the help I get here when I am stuck.

Desired output is an index:

<a href="Some.Title.html">http://www.SomeLink.com/yep.html</a>
<a href="Some.Title.txt">http://www.SomeLink.com/yeppers.txt</a>

A tip: Don't use `regex` to parse html, use something built for that, like BeautifulSoup. — Vinícius Figueiredo, Jul 31 '17 at 02:43
I have no clue how to use beautiful soup. It is so rare I get into anything like this. Appreciate your advise, I really should learn for these silly moments however. — Bobby Peters, Jul 31 '17 at 02:50
It's just that if you really need to dig into html parsing, it's really recommended to use something written for that, because `regex` can't handle nested patterns. What would be the desired output? — Vinícius Figueiredo, Jul 31 '17 at 02:55
Desired output is and index: Wont come up right here ill put it in description — Bobby Peters, Jul 31 '17 at 03:02
*It is so rare I get into anything like this* => Rare opportunity to learn ! Beautiful Soup is a better solution suited for this usecase as @ViníciusAguiar mentions — karthikr, Jul 31 '17 at 03:13
I appreciate the advise I will put some time into BeautifulSoup. Thanks folks — Bobby Peters, Jul 31 '17 at 03:18
https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 This might be helpful :) As the other comments above have mentioned, do try out BeautifulSoup. — Julian Chan, Jul 31 '17 at 03:21

score 3 · Answer 1 · answered Jul 31 '17 at 03:20

Using the Beautiful soup and requests module would be perfect for something like this instead of regex as the commenters noted above.

import requests
import bs4

html_site = 'www.google.com' #or whatever site you need scraped
site_data = requests.get(html_site) # downloads site into a requests object
site_parsed = bs4.BeautifulSoup(site_data.text) #converts site text into bs4 object
a_tags = site_parsed.select('a') #this will select all 'a' tags and return list of them

This just a simple code that will select all the tags from the html site and store them in a list with the format that you illustrated up above. I'd advise checking here for a nice tutorial on bs4 and here for the actual docs.

Python Regex Match Line If Ends With?

1 Answers1