I am trying to scrape contents from this page, see code below. I am curious, though, as if I run the code repeatedly, I keep getting a different list of job locations (and thus, reviews), even though the displayed page in my browser is the same. E.g. The first iteration is correct, but running the script a second time with the same starting URL, the locations "University Village" and "Remote Telework" disappear from the list (and "San Salvador" and "Atlanta" enter, so that the list is of the same length).
As far as I can see, there is no "hidden" text, ie. all these should be visible (and are in the first iteration). What is going on? How can I make sure to grab all contents (I need to repeat for a few thousand pages, so I don't want to go through the scrapped data manually).
This question is related, but I don't think it is an IP issue here, since I can get the displayed content in the first iteration.
Edited to add: the code actually skips some reviews, even though those are identified, as far as I can see, exactly like the ones the code picks up.
Here is the code (simplified):
list_url= ["http://www.indeed.com/cmp/Microsoft/reviews?fcountry=ALL"]
for url in list_url:
base_url_parts = urllib.parse.urlparse(url)
while True:
raw_html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(raw_html,"lxml")
review_tag={'class':re.compile("cmp-review-container")}
reviews=soup.find_all(attrs=review_tag)
job_locations=[]
for r in reviews:
if r.find(attrs={'class':"cmp-reviewer-job-location"}) != None:
job_location=r.find(attrs={'class':"cmp-reviewer-job-location"}).get_text().strip().encode('utf-8')
else:
job_location = "."
job_locations.append(job_location)
#Zip the data and write the observations to the CSV file
try:
last_link = soup.find('div', id='company_reviews_pagination').find_all('a')[-1]
if last_link.text.startswith('Next'):
next_url_parts = urllib.parse.urlparse(last_link['href'])
url = urllib.parse.urlunparse((base_url_parts.scheme, base_url_parts.netloc,
next_url_parts.path, next_url_parts.params, next_url_parts.query,
next_url_parts.fragment))
print(url)
else:
break
except:
break
csvfile.close()
PS. Sorry if this is not the right place to post this question; let me know of a more appropriate place in this case.