3

This is what im trying to scrape:

        <p>Some.Title.html<br />
<a href="https://www.somelink.com/yep.html" rel="nofollow">https://www.somelink.com/yep.html</a><br />
Some.Title.txt<br />
<a href="https://www.somelink.com/yeppers.txt" rel="nofollow">https://www.somelink.com/yeppers.txt</a><br />

I have tried several variations of the following:

match = re.compile('^(.+?)<br \/><a href="https://www.somelink.com(.+?)">',re.DOTALL).findall(html)

I am looking to match lines with the "p" tag and without. "p" tag only occurs on the first instance. Terrible at python so I am pretty rusty, have searched through here and google and nothing seemed to be quite the same. Thanks for any help. Really do appreciate the help I get here when I am stuck.

Desired output is an index:

<a href="Some.Title.html">http://www.SomeLink.com/yep.html</a>
<a href="Some.Title.txt">http://www.SomeLink.com/yeppers.txt</a>
Bobby Peters
  • 181
  • 1
  • 1
  • 9
  • 7
    A tip: Don't use `regex` to parse html, use something built for that, like BeautifulSoup. – Vinícius Figueiredo Jul 31 '17 at 02:43
  • I have no clue how to use beautiful soup. It is so rare I get into anything like this. Appreciate your advise, I really should learn for these silly moments however. – Bobby Peters Jul 31 '17 at 02:50
  • 2
    It's just that if you really need to dig into html parsing, it's really recommended to use something written for that, because `regex` can't handle nested patterns. What would be the desired output? – Vinícius Figueiredo Jul 31 '17 at 02:55
  • Desired output is and index: Wont come up right here ill put it in description – Bobby Peters Jul 31 '17 at 03:02
  • 1
    *It is so rare I get into anything like this* => Rare opportunity to learn ! Beautiful Soup is a better solution suited for this usecase as @ViníciusAguiar mentions – karthikr Jul 31 '17 at 03:13
  • I appreciate the advise I will put some time into BeautifulSoup. Thanks folks – Bobby Peters Jul 31 '17 at 03:18
  • 1
    https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 This might be helpful :) As the other comments above have mentioned, do try out BeautifulSoup. – Julian Chan Jul 31 '17 at 03:21

1 Answers1

3

Using the Beautiful soup and requests module would be perfect for something like this instead of regex as the commenters noted above.

import requests
import bs4

html_site = 'www.google.com' #or whatever site you need scraped
site_data = requests.get(html_site) # downloads site into a requests object
site_parsed = bs4.BeautifulSoup(site_data.text) #converts site text into bs4 object
a_tags = site_parsed.select('a') #this will select all 'a' tags and return list of them

This just a simple code that will select all the tags from the html site and store them in a list with the format that you illustrated up above. I'd advise checking here for a nice tutorial on bs4 and here for the actual docs.

Matthew Barlowe
  • 2,229
  • 1
  • 14
  • 24