A bit of a backstory. I am trying to scrape pastebin's archive page and get only the IDs of pastes. IDs are 8 characters long and an example link to a paste is as follows: "https://pastebin.com/A8XGWYBu"
The code I have written currently is able to grab all data from the <a> tag but it also retrieves information which is unnecessary.
import requests
import re
from bs4 import BeautifulSoup
def get_recent_id():
URL = requests.get('https://pastebin.com/archive', verify=False)
href_regex = r"<a href=\"\/(.*?)\">(.*?)<\/a>"
soup = BeautifulSoup(URL.content, 'html.parser')
pastes = soup.find_all('a')
# Works good here
# prints the necessary things using the regex above
pastes_findall = re.findall(href_regex, str(pastes))
try:
for id, t in pastes_findall:
output = f"{t} -> {id}"
get_valid = r'(.*?) \-\> ([A-Za-z\d+]{8})'
final = re.findall(get_valid, output)
print(final)
except IndexError:
pass
get_recent_id()
Where it breaks is with the regex in the try
statement. It does not return the information that I am expecting, instead it returns blank [] brackets.
Example output using the regex in the try
statement.
[]
[]
[]
[]
...
I have tested the regex in regex101 and it works just fine against the output of the output
variable.
The output I am trying to achieve should return only the title and paste ID and should looks as follows:
blood sword v1.0 -> cvWdRuaV
lab2 -> eRJY9YAb
example 210526a -> A2sv2shx
2021-05-26_stats.json -> wjsmucFF
2021-05-25_stats.json -> TsXrW7ex
Flake#5595 (466999758096039936) RD -> q8tHsgMz
Untitled -> akrSbCyT
...
I am not sure why I get nothing out of the output when regex101 clearly shows that there are matches in 2 groups. If anyone is able to help I would appreciate it !
Thanks !