0

I'm trying to create a scraper for olx website (www.olx.pl) using requests and beautifulsoup. I don't have any problems with most of data, but the phone number is hidden (One has to first click it). I've already tried to use chrome inspect to see what is happening in the "Network" tab when I click it manually. There is an ajax request with this information "?pt=5d1480fbad0a1f2006e865bfdf7a6fb07f244b82e17ab0ea4c5eaddc43f9da391b098e1926642564ffb781655d55be270c6913f7526a08298f43b24c0169636b" This is the phoneToken which may be found in the website source (it changes on each page load). I tried to send this kind of request using requests library, but I got "000 000 000" in response. I can get the phone number using Selenium, but it is so slow to load.

The question is: Is there a way to get around those security phone tokens? or How to speed up Selenium to scrape phone number in let's say 1-2sec?

Ad example: https://www.olx.pl/561666735

EDIT: Actually, now in response I get the message that my IP address is blocked. (But only using requests, ip is not blocked when I load page manually). Unfortunately I made some changes and I can't reproduce the code, to get '000 000 000' in response. This is part of my code right now.

def scrape_phone(id):
    s = requests.Session()
    url = "https://www.olx.pl/{}".format(id)
    response = s.get(url, headers=headers)
    page_text = response.text
    # getting short id
    index_of_short_id = page_text.index("'id':'")
    short_id = page_text[index_of_short_id:index_of_short_id+11].split("'")[-1]
    # getting phone token
    index_of_token = page_text.index("phoneToken")
    phone_token = page_text[index_of_token+10:index_of_token+150].split("'")[1]
    url = "https://www.olx.pl/ajax/misc/contact/phone/{}".format(short_id)
    data = {
        'pt': phone_token
    }
    response = s.post(url, data=data, headers=headers)
    print(response.text)
    
scrape_phone(540006276)
Retip
  • 1
  • 2
  • Are you sending cookies in the request? Or maybe the ajax response is encrypted, and some javascript must be run to decipher it? – Bober Nov 06 '19 at 20:18
  • Can you [edit] question to add your current code and explain what isn't working please? – QHarr Nov 06 '19 at 22:37
  • Yes, I guess what @Bober said is correct. Try chrome headless with selenium to speed up a bit . https://stackoverflow.com/questions/53657215/running-selenium-with-headless-chrome-webdriver – Jithin P James Nov 07 '19 at 11:32

0 Answers0