0

I have a lot of web scraping to do so I switched to a headless browser hoping that would make things faster, but it didn't improve the speed by much.

I looked at this stack overflow post but I don't understand the answer someone wrote Is Selenium slow, or is my code wrong?

here is my slow code:

# followed this tutorial https://medium.com/@stevennatera/web-scraping-with-selenium-and-chrome-canary-on-macos-fc2eff723f9e
from selenium import webdriver
options = webdriver.ChromeOptions()
options.binary_location = '/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary'
options.add_argument('window-size=800x841')
options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://poshmark.com/search?')
xpath='//input[@id="user-search-box"]'
searchBox=driver.find_element_by_xpath(xpath)

brand="anthropology"

style="headband"

searchBox.send_keys(' '.join([brand,style]))

from selenium.webdriver.common.keys import Keys
#EQUIValent of hitting enter key
searchBox.send_keys(Keys.ENTER)




url=driver.current_url
print(url)
import requests
response=requests.get(url)
print(response)


print(response.text)
# using beautiful soup to grab the listins:






#______________________________


#print(response)
html=response.content
from bs4 import BeautifulSoup
from urllib.parse import urljoin



#print(html)
soup=BeautifulSoup(html,'html.parser')

#'a' as in links or anchore tags
anchore_tags=soup.find_all('a')


#print(x)




# finding the hyper links
#href is the hyperlink
hyper_links=[link.get("href") for link in soup.find_all("a")]
#print(hyper_links)

                        #(Better visual link this )
                        #href is the hyperlink
                        # for link in soup.find_all("a"):
                        #
                        #     print(link.get("href"))

clothing_listings=set([listing for listing in hyper_links if listing and "listing" in listing]) #  if the element and the word listing is in the element (becuase there could be a hyperlink that is NONE whcich is why we need the and )
# turning the list into a set because some of them are repeated
print(len(clothing_listings))
print(set(clothing_listings))
print(len(set(clothing_listings)))

#for somereason a link that is called unlike is showing up so im geting rid of those
clothing_listings=set([listing for listing in hyper_links if listing and "unlike" in listing]) #  if the element and the word listing is in the element (becuase there could be a hyperlink that is NONE whcich is why we need the and )
print(len(clothing_listings))# this is the correct size of the amount of clothing items by that search





driver.quit()

Why is it taking so long to scrape things?

Bob
  • 279
  • 6
  • 13
  • 1
    selenium is all about bloatware. If you want something fast, use python and lxml or even better: C or GO. The main goal of a headless browser is **not** speed execution but the possibility to scrape JS generated page web site, makes screenshots... – Gilles Quénot Mar 31 '18 at 15:46
  • Great !!! seems you got _headless_ working now but you haven't responded to my Answer on [**trouble running chrome headless browser**](https://stackoverflow.com/questions/49581940/trouble-running-chrome-headless-browser/49582534#49582534) – undetected Selenium Mar 31 '18 at 16:01
  • @DebanjanB sorry that was because i posted an answer and someone took it down :/ – Bob Mar 31 '18 at 16:02
  • @GillesQuenot do you understand the solution from the stackoverflow link? – Bob Mar 31 '18 at 16:04
  • @Bob Ofcoarse, you got a warning message from Review Team as _While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes._ which you haven't responded ever. – undetected Selenium Mar 31 '18 at 16:06
  • @DebanjanB that is correct but I didn't want to check your answer off as correct because that wasn't how I solved my issue. Does that make sense? – Bob Mar 31 '18 at 16:13
  • @GillesQuenot this link https://stackoverflow.com/questions/17462884/is-selenium-slow-or-is-my-code-wrong – Bob Mar 31 '18 at 16:14
  • @Bob Thanks, that makes sense. – undetected Selenium Mar 31 '18 at 16:15
  • @Bob, do you understand my comment ? The big picture behind ? – Gilles Quénot Mar 31 '18 at 16:25
  • @GillesQuenot sort of, I'm not a code wise so I don't really understand your options for solutions but I understand your point about selenium – Bob Apr 01 '18 at 04:03

1 Answers1

2

You're using requests to fetch the URL. So, why not use it to accomplish the entire task. The part where you use selenium seems redundant. You merely open the link using it, and then use requests to fetch the resulting URL. All you have to do is pass appropriate headers, which you can gather by viewing the network tab of developer tools in Chrome or Firefox.

rh = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9',
    'referer': 'https://poshmark.com/search?',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}

Modify the URL to search for a specific term:

query = 'anthropology headband'
url = 'https://poshmark.com/search?query={}&type=listings&department=Women'.format(query)

And then, use BeautifulSoup. Also, you can narrow down the links you scrape by using any attribute that's specific to the ones you want. In your case, it's the class attribute of covershot-con.

r = requests.get(url, headers = rh)
soup = BeautifulSoup(r.content, 'lxml')

links = soup.find_all('a', {'class': 'covershot-con'})

Here's the result:

for i in links:
    print(i['href'])

/listing/Anthro-Beaded-Headband-5a78fb899a9455e90aef438e
/listing/NWT-ANTHROPOLOGIE-Twisted-Vines-Crystal-Headband-5abbfb4a07003ad2dc58142f
/listing/Anthropologie-Nicole-Co-White-Floral-Headband-59dea5adeaf0302a5600bc41
/listing/NWT-ANTHROPOLOGIE-Namrata-Spring-Blossom-Headband-5ab5509d72769b52ba31829e
.
.
.
/listing/Anthropologie-By-Lilla-Spiky-Blue-Headband-59064f2ffbf6f90bfb01b854
/listing/Anthropologie-Beaded-Headband-5ab2cfe79d20f01a73ab0ddb
/listing/Anthropologie-Floral-Hawaiian-Headband-59d09eb941b4e0e1710871ec

Edit (Tips):

  1. Use selenium as a last resort (when all other methods fail). Like @Gilles Quenot says, selenium is not for fast execution of web requests.

  2. Learn how to work with the requests library (using headers, passing data, etc.). Their documentation page is more than enough to get started. It'll suffice for most scraping tasks, and it's fast.

  3. Even for pages that require JS execution, you can get by with requests if you can figure out how to execute JS part using a library like js2py.

  • thank you very much! I have a few questions though 1) why can't I just query the request as queryParameters={'query':'+'.join([brand,"headband"]),'type':'listings','department':'Women'} response=requests.get(search,params=queryParameters) as appose to using the rh ? When I queried things this way I get a response but not the html I am looking for. I also don't know how to find the appropriate headers on chrome, where do I highlight the area to look for this? – Bob Apr 03 '18 at 14:54
  • @Bob I tried passing the query parameters as data rather than hardcode them in the URL. It didn't work and I have no idea why. Also, the `rh` isn't related to this. `rh` is merely a `dict` variable (short for request headers) to store the headers that I copied from Chrome's network tab. See this: https://www.mkyong.com/computer-tips/how-to-view-http-headers-in-google-chrome/ –  Apr 03 '18 at 15:08
  • thank you! do you know if the request headers are static?Also are you saying that when you tried queryParameters={'query':'+'.join([brand,"headband"]),'type':'listings','department':'Women'} response=requests.get(search,params=queryParameters) nothing worked? because i do get a response its just for some reason when I try and get the listings they are a much lower number then they should be. – Bob Apr 03 '18 at 15:13
  • If you meant to ask if they were constant for a *specific* page, then yeah, but probably not for a very long time. If, on the other hand, you meant to ask if these were generic headers that can be passed with request for any page on any site, then no. You'll have to record the activities (also in the network tab - please google it), and then copy the headers that the browser sends. –  Apr 03 '18 at 15:17
  • I meant the first question, because it's a dense dictionary so I was afraid it would change over time. – Bob Apr 03 '18 at 15:22
  • @Bob passing the parameters with `params` argument rather than hardcoding in the URL gives me the same result as you get: about 25 results, which is less than what my initial solution gave. When I said it didn't work, I passed it via `data` argument, which is probably why it didn't work for me in the beginning. –  Apr 03 '18 at 15:24