0

A bit of a backstory. I am trying to scrape pastebin's archive page and get only the IDs of pastes. IDs are 8 characters long and an example link to a paste is as follows: "https://pastebin.com/A8XGWYBu"

The code I have written currently is able to grab all data from the <a> tag but it also retrieves information which is unnecessary.

import requests
import re
from bs4 import BeautifulSoup

def get_recent_id():
    
    URL = requests.get('https://pastebin.com/archive', verify=False)

    href_regex = r"<a href=\"\/(.*?)\">(.*?)<\/a>"

    soup = BeautifulSoup(URL.content, 'html.parser')
    pastes = soup.find_all('a')

    # Works good here
    # prints the necessary things using the regex above
    pastes_findall = re.findall(href_regex, str(pastes))

    try:
        for id, t in pastes_findall:
            output = f"{t} -> {id}"
            get_valid = r'(.*?) \-\> ([A-Za-z\d+]{8})'

            final = re.findall(get_valid, output)
            print(final)
    except IndexError:
        pass

get_recent_id()

Where it breaks is with the regex in the try statement. It does not return the information that I am expecting, instead it returns blank [] brackets.

Example output using the regex in the try statement.

[]
[]
[]
[]
...

I have tested the regex in regex101 and it works just fine against the output of the output variable.

Example in regex101: Regex101 Example Detection

The output I am trying to achieve should return only the title and paste ID and should looks as follows:

blood sword v1.0 -> cvWdRuaV
lab2 -> eRJY9YAb
example 210526a -> A2sv2shx
2021-05-26_stats.json -> wjsmucFF
2021-05-25_stats.json -> TsXrW7ex
Flake#5595 (466999758096039936) RD -> q8tHsgMz
Untitled -> akrSbCyT
...

I am not sure why I get nothing out of the output when regex101 clearly shows that there are matches in 2 groups. If anyone is able to help I would appreciate it !

Thanks !

Kr0ff
  • 49
  • 6
  • 2
    It is considered bad form to parse html via regexes - use a html parser: [TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Patrick Artner May 27 '21 at 11:24
  • You mean to just rely on BeautifulSoup ? – Kr0ff May 27 '21 at 11:44

3 Answers3

2

You can achieve your desired output using fewer lines of codes. Make sure your bs4 version is up to date or at least >= 4.7.0 for it to support pseudo css selector which I've used within the script.

import requests
from bs4 import BeautifulSoup

link = 'https://pastebin.com/archive'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("table.maintable tr:has(> td > a[href]) > td:nth-of-type(1) > a"):
        title = item.text
        _id = item.get("href").lstrip("/")
        print(title," -> ",_id)

Output at this moment (truncated):

new_meta_format  ->  JjMxWDzh
Paste Ping  ->  bH54QCb9
Untitled  ->  EEMQigvX
free checked credit cards  ->  b6LE4e78
Untitled  ->  wJA8Axbb
Untitled  ->  fFFrEJnv
Untitled  ->  A8XGWYBu
Ejercicio01  ->  CqP4grhP
Ejercicio01  ->  nhxM8Tca
Untitled  ->  8Y485jwG
f_get_product_balance_stock_exclude_reserved  ->  hc64MsgH
in_product_balance_stock_reserved  ->  ZGXgRWKQ
My Log File  ->  24TnZK2F
Untitled  ->  tvbwuWkL
MITHU
  • 113
  • 3
  • 12
  • 41
1

I think you do not need a regex. You can get the href value of each pastes, strip the / chars(s) and then produce the output value by appending -> and the text value of the a element:

[i["href"].strip('/') + " -> " + i.get_text() for i in pastes]

The whole method will look like

def get_recent_id():
    URL = requests.get('https://pastebin.com/archive', verify=False)
    href_regex = r"<a href=\"\/(.*?)\">(.*?)<\/a>"
    soup = BeautifulSoup(URL.content, 'html.parser')
    pastes = soup.find_all('a')
    return [i["href"].strip('/') + " -> " + i.get_text() for i in pastes]
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
-1

So playing a little, I was able to find an answer to my question.

@Wiktor, your answer was good but still returned some results which I dont need.

The final code looks like this:

def get_recent_id():
    
    URL = requests.get('https://pastebin.com/archive', verify=False)

    href_regex = r"<a href=\"\/(.*?)\">(.*?)<\/a>"

    soup = BeautifulSoup(URL.content, 'html.parser')
    pastes = soup.find_all('a')
    
    # Works until here
    # prints the necessary things using the regex above
    pastes_findall = re.findall(href_regex, str(pastes))

    try:
        for id, t in pastes_findall:
            output = f"{t} -> {id}"
            get_valid = r'(.*?) \-\> ([A-Za-z\d+]{8})'
            final = re.search(get_valid, output)
            
            if final is None:
                pass
            else:
                final = final.group(0)
                print(final)
            
    except IndexError:
        pass

get_recent_id()

So essentially, I had couple of other things in the output variable locally which I did not show here in my post. After removing those, what I have originally posted worked out (should've tried it earlier...).

I was then getting a "NoneType" error but a simple if statement resolved this as well.

At the end I am now getting the needed output and it is as follows:

$ ./tool.py

Paste Ping -> bH54QCb9
Untitled -> EEMQigvX
free checked credit cards -> b6LE4e78
Untitled -> wJA8Axbb
Untitled -> fFFrEJnv
Untitled -> A8XGWYBu
Ejercicio01 -> CqP4grhP
Ejercicio01 -> nhxM8Tca
Untitled -> 8Y485jwG
f_get_product_balance_stock_exclude_reserved -> hc64MsgH
in_product_balance_stock_reserved -> ZGXgRWKQ
My Log File -> 24TnZK2F
Untitled -> tvbwuWkL
Woocommerce Minimum Order Amount -> j35Hg0Ci
...

Thanks for the answer !

Kr0ff
  • 49
  • 6