2

Does Python3 have a JavaScript based scraping library that is not Selenium? I'm trying to scrape https://www.mailinator.com/v2/inbox.jsp?zone=public&query=test, but the inbox is loaded with JavaScript. The reason I don't want to use Selenium is I don't want it to open a window when I run it.

Here is my non-working code:

import requests
from bs4 import BeautifulSoup as soup
INBOX = "https://www.mailinator.com/v2/inbox.jsp?zone=public&query={}"
def check_inbox(name):
    stuff = soup(requests.get(INBOX.format(name)).text,"html.parser")
    print(stuff.find("ul",{"class":"single_mail-body"}))
check_inbox("retep")

Do any such libraries exist?

I couldn't find anything for the Google search python 3 javascript scraper outside of Selenium.

Peter S
  • 827
  • 1
  • 8
  • 24
  • Possible duplicate of [Web-scraping JavaScript page with Python](https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python) – Hum4n01d Oct 23 '17 at 21:58
  • @Hum4n01d this is python3, not python. – Peter S Oct 23 '17 at 21:59
  • I don't see why that would make a difference. – Hum4n01d Oct 23 '17 at 22:00
  • different syntax, libraries aren't compatible – Peter S Oct 23 '17 at 22:00
  • Ok, but overall the solution is still going to be the same. You need a library that renders the page with JavaScript before you start scraping. – Hum4n01d Oct 23 '17 at 22:02
  • I would like a *python3.x* library that works. I can't use *python2.x*. – Peter S Oct 23 '17 at 22:03
  • One possible approach is to see if PhantomJS has Python 3 bindings. [This might help](https://stackoverflow.com/questions/13287490/is-there-a-way-to-use-phantomjs-in-python). Your question may benefit though from an explanation as to why you wish to avoid Selenium. – halfer Oct 23 '17 at 22:14
  • since it's websockets, you'll have no luck with phantomjs. – Loïc Oct 23 '17 at 22:23

1 Answers1

1

You don't need javascript actually, because it's client side, so you can emulate it.

If you inspect the webpage (developer tools > network), you'll see that there is a websocket connection to this :

wss://www.mailinator.com/ws/fetchinbox?zone=public&query=test

Webpage inspection

Now if you implement a websocket client using python, you'll be able to cleanly fetch your mails (see this : https://github.com/aaugustin/websockets/blob/master/example/client.py).

EDIT :

As mentioned by John, augustin's ws client repo is dead. Today I'd use this : https://websockets.readthedocs.io/en/stable/

Loïc
  • 11,804
  • 1
  • 31
  • 49
  • hmm... it's not working for me - `websockets.exceptions.InvalidStatusCode: Status code not 101: 500` – Peter S Oct 23 '17 at 22:17
  • ```import websockets, asyncio from bs4 import BeautifulSoup as soup INBOX = "wss://www.mailinator.com/ws/fetchinbox?zone=public&query=test" async def hello(): async with websockets.connect(INBOX) as ws: response = await ws.recv() print(response) asyncio.get_event_loop().run_until_complete(hello())``` – Peter S Oct 23 '17 at 22:18
  • that's an internal server error. But it's another topic. I'd suggest you make another question with what you are trying, and why it doesn't work. My guess is, you have to send some headers (maybe the cookies). Also you should look at the source code, and see how they do their websocket connection. Maybe you have to register to a channel. Also, try a bit more than 5 minutes before asking the community ;) – Loïc Oct 23 '17 at 22:18
  • select `WS` filter, that stands for "websockets" – Loïc Oct 23 '17 at 22:25
  • 1
    I noticed that one of the headers is changing every time - `Sec-WebSocket-Key`. How would I go about generating one of these? – Peter S Oct 23 '17 at 22:28
  • I'd suggest you look at the source code, and see how it's done in javascript so you can emulate it. But really that'll be my last comment on that topic, open a new question when you have something to show with what you try if needed, link it here if you'd like, i'll give it a look. But then you seem to be on the good path, look for that `Sec-Websocket-Key` – Loïc Oct 23 '17 at 22:30
  • 1
    @Loïc the github link is broken, do you have a code snippet you could add to your answer? – Coder Nov 28 '21 at 08:17