1

I am trying to scrape contents from this page, see code below. I am curious, though, as if I run the code repeatedly, I keep getting a different list of job locations (and thus, reviews), even though the displayed page in my browser is the same. E.g. The first iteration is correct, but running the script a second time with the same starting URL, the locations "University Village" and "Remote Telework" disappear from the list (and "San Salvador" and "Atlanta" enter, so that the list is of the same length).

As far as I can see, there is no "hidden" text, ie. all these should be visible (and are in the first iteration). What is going on? How can I make sure to grab all contents (I need to repeat for a few thousand pages, so I don't want to go through the scrapped data manually).

This question is related, but I don't think it is an IP issue here, since I can get the displayed content in the first iteration.

Edited to add: the code actually skips some reviews, even though those are identified, as far as I can see, exactly like the ones the code picks up.

Here is the code (simplified):

list_url= ["http://www.indeed.com/cmp/Microsoft/reviews?fcountry=ALL"]

for url in list_url:
 base_url_parts = urllib.parse.urlparse(url)
 while True:
    raw_html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(raw_html,"lxml")

    review_tag={'class':re.compile("cmp-review-container")}
    reviews=soup.find_all(attrs=review_tag)

    job_locations=[]

    for r in reviews:
        if r.find(attrs={'class':"cmp-reviewer-job-location"}) != None:
            job_location=r.find(attrs={'class':"cmp-reviewer-job-location"}).get_text().strip().encode('utf-8')
        else:
            job_location = "."
        job_locations.append(job_location)

#Zip the data and write the observations to the CSV file

    try:
       last_link = soup.find('div', id='company_reviews_pagination').find_all('a')[-1]
       if last_link.text.startswith('Next'):
           next_url_parts = urllib.parse.urlparse(last_link['href'])
           url = urllib.parse.urlunparse((base_url_parts.scheme, base_url_parts.netloc,
            next_url_parts.path, next_url_parts.params, next_url_parts.query,
            next_url_parts.fragment))
           print(url)
       else:
           break
    except: 
       break

csvfile.close()

PS. Sorry if this is not the right place to post this question; let me know of a more appropriate place in this case.

Community
  • 1
  • 1
anne_t
  • 435
  • 1
  • 7
  • 16

1 Answers1

0

In my opinion, it's related with Ajax request in your target url, I could find some XHR type requests when I visit it.

For Ajax related website, "What the user sees, what the crawler sees" is quite different. urllib or requests will only visit data in the first time of page load, while some contents may loss here.

If you want to crawl a website with Ajax request, I recommend to use CasperJS, which is based on PhantomJS, it mocks what people visit website, and will wait until all data you need loaded to do further work, it's also related with python, please check here :)

====== UPDATE ======

I add another link scraping-with-python-selenium-and-phantomjs, it's related with phantomjs and beautifulsoup together, and may be useful for some cases.

Community
  • 1
  • 1
linpingta
  • 2,324
  • 2
  • 18
  • 36
  • Thanks. Is there any way to use CasperJS to open and read the url, yet keep the processing part beautifulsoup-based? Or would I get the same results if I were to use ghost.py (again keeping the main part of the code unchanged)? I'm a beginner and starting all over is, mmm, intimidating? :) – anne_t Jul 16 '16 at 12:45
  • Hi anne_t, in my opinion, the general way to use bs and casperjs together is to use a single process (like Popen) to call casperjs script inside your py, and use bs to deal with html downloaded by casperjs. Besides, I edit my answer and include a link with PhantomJS and bs together. I am not sure about ghost.py part, I didn't deal with it before :) – linpingta Jul 17 '16 at 10:28
  • Thanks! I'll definitely look it up! – anne_t Jul 18 '16 at 13:29