1

I want to get a raw data using password from certain locked pastebin link with python. I can't figure out what to do.

Is it impossible to get pastebin raw data using python's requests module and post method? I tried it as below code but it returns error.

url = "https://pastebin.com/URL"
pass_data = {'PostPasswordVerificationForm[password]': 'password'}
res = requests.post(url, headers=headers, data = pass_data) 
text = res.text
print(text)  

It returns below error:

raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='pastebin.com', port=443): 
Max retries exceeded with url: /URL (Caused by SSLError(SSLCertVerificationError
(1, '[SSL: CERTIFICATE_VERIFY_FAILED]certificate verify failed: 
self signed certificate in certificate chain (_ssl.c:1123)')))

Can someone please tell me which one I can use?

martineau
  • 119,623
  • 25
  • 170
  • 301
vantabeam
  • 45
  • 1
  • 6

1 Answers1

3

Note: Consider using Pastebin's API and Pastebin's scraping API.

Your certificate verification failed (proxy/tor/vpn/web without cert/misconfigured web?). If you still want to proceed, simply use verify=False as an argument for the requests.post():

requests.post(url="...", verify=False)

If you are using a VPN, perhaps you've been provided with a root certificate for your machine and you can apply it with cert=("path to cert", "path to key").

If you are using Tor, better skip that circuit and re-create a new one.

For proxy, it's complicated and can be either cert issue or just being plainly misconfigured/broken.

You can verify there's no proxy used by checking your network sessings (OS specific) and environment variables requests package works with:

  • http_proxy
  • HTTP_PROXY
  • https_proxy
  • HTTPS_PROXY
  • curl_ca_bundle

Edit: I've just re-checked Pastebin, the RAW text option is only available for the unprotected pastes. However, you can get to the HTML version by inspecting the traffic, then re-assembling it with code simply by keeping the session, checking cookies and headers in the network tab. You should get something like this:

import requests as r
ses = r.Session()
cookie = ses.get("https://pastebin.com").cookies["_csrf-frontend"]
# The missing step here is reworking the provided CSRF by client-side
# JS which is "hidden" in the minified jquery.min.js (or at least the
# `POST` is issued by it). Once you have it, you can put it to the
# data field
print(ses.post(
    url='https://pastebin.com/<your paste>',
    headers={
        'User-Agent': "<user agent to spoof it's via Requests>",
        'Accept': (
            'text/html'
            ',application/xhtml+xml'
            ',application/xml'
            ';q=0.9,image/webp,*/*;q=0.8'
        ),
        'Accept-Language': 'en-US,en;q=0.5',
        'Content-Type': 'application/x-www-form-urlencoded'
    },
    data=(
        '_csrf-frontend=<JS-manipulated CSRF value>'
        '&is_burn=1'
        '&PostPasswordVerificationForm%5Bpassword%5D=<pass>'
    )
).text)

Afterwards just check for the tag with RAW in it and then parse it either by some quick regex (obligatory "it's a stupid idea" post) or use a less error-prone solution such as BeautifulSoup.

Nevertheless, captchas, IP blacklisting, "clever" CSRF handling and similar stuff will eventually prevent you from such scraping and if not it's just too easy to assemble an application that will dynamically change its class names, tag names, etc in Angular just to mess with your scraping for the lulz (Google Docs love this stuff, personal experience), so if you intend to do something serious with it, just use the API.

Edit2: Minor FAQ for scraping / why to use the API

  • If the website doesn't allow scraping or forbids it in its ToS you should not be doing it. Although people ignore it mostly, it's not smart to do it from a non-anon device/IP especially if there's a an idea of making money out of it because then people start looking (even legally).
  • No, Tor will not work, especially because it's full of captchas once in there.
  • Yes, anyone who is at least a bit capable of reading server logs can figure out what you'll be doing and block you by IP, User-Agent or just mess with you by serving random data (did that, was quite fun to see the traffic logs later on :D )
  • Yes, even VPNs and proxies can be blocked, just like with Tor only they'll log the activity and make you pay
  • Once Pastebin changes any part of the scraped flow you can start re-inventing it from scratch
Peter Badida
  • 11,310
  • 10
  • 44
  • 90
  • Thank you. I tried with ```verify=False``` and it returns ```Bad Request (#400)``` and ```Unable to verify your data submission.``` Maybe I should try another method. – vantabeam Jul 10 '21 at 21:10
  • @vantabeam That's actually fine, 400 means you can connect to the server and the server simply said back that your body (or url, or headers, or all of them) is not correct. Perhaps try to check the API docs for `Content-Type` being `application/json` (in that case change `data=` to `json=`). – Peter Badida Jul 10 '21 at 21:12
  • @vantabeam [Pastebin's API documentation](https://pastebin.com/doc_api). – Peter Badida Jul 10 '21 at 21:14
  • Actually, I don't know about API now.. so I have to learn about API things first. haha Thank you – vantabeam Jul 10 '21 at 21:17
  • 1
    @vantabeam not really, see this example: `requests.get("https://pastebin.com/raw/kmySM61Y").text` - works fine and the content can be retrieved. In fact, if you can retrieve it via the browser, you can retrieve it with Python (or other lang). Just inspect the traffic for headers, cookies and other modifiers of the plain request. – Peter Badida Jul 10 '21 at 21:43
  • Oh, I deleted one of my comments to add an edited comment without knowing you commented. And yes, that works fine! Now I have to think about how to do it in ```requests.post```. Actually, I will be running this code on a "heroku" server in the future. Would that be a problem? – vantabeam Jul 10 '21 at 21:58
  • @vantabeam Check the edit. No clue where the JS code is, but if you're persistent enough, it'll be easy to find. Perhaps just throw all of the JS code to some de-minifier and grab a cup of coffee. Nevertheless, once they pick your IP or if they expect (or suddenly start expecting) captcha on that web you'll be screwed anyway. :-) – Peter Badida Jul 10 '21 at 22:44
  • Thank you so much for your detailed answer! But sadly, I got error again as ```raise SSLError(e, request=request) requests.exceptions.SSLError: HTTPSConnectionPool(host='pastebin.com', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123)')))``` – vantabeam Jul 10 '21 at 23:41
  • And that's the good idea to get the HTML version instead of raw data. But continued error is catching up with me :( – vantabeam Jul 10 '21 at 23:44
  • That error was due to Kaspersky! I turned it off and tried again. But again, it returned ```
    Bad Request (#400)
    Unable to verify your data submission.
    ```
    – vantabeam Jul 11 '21 at 00:07