So my goal is given a quizlet link (or id of the set of flashcards I'm wanting to access) to retrieve the text of the flashcards. I could've done this with their api but it seems to be non-existent now. I can also pretty easily webscrape them but worry this will break their TOS and/or possibly result in an IP ban. Is there any other way to access the data or is webscraping the only way?
-
What have you tried so far? `requests.get(
).text` using Python's `requests` library can be good start. – DaveIdito Sep 15 '20 at 22:23 -
@DaveIdito well I would have no issues webscraping it that would be easy for me. However a lot of websites will block your ip if you webscrape without their consent so was wondering if anyone knew how quizlet expected us to access the data. – mizuprogrammer Sep 15 '20 at 22:41
2 Answers
There's no one silver bullet to solve this problem (see this answer how many potential ways are there to stop web-scraping attempts). But here are potential solutions (in increasing order of difficulty).
1. Use proper HTTP user-agent
Here's a PIP package that can help you manage it.
2. Add some randomness to when simultaneous requests are sent
Instead of running, say requests.get(<url>)
in while loop or even in multi-process/thread, add a time.sleep(<some random time>)
.
3. Simulate a real browser
You can use a WebDriver which will run and render the scraped page as if it were running in a browser (like Chromium or Firefox, etc.). You can even use in a headless mode; Python Selenium would be one potential choice. This way, if the Javascript itself is trying to thwart your web-scraping attempts (for instance a React-rendered page, or the Google Webstore, you won't have to worry about such things at all.
4. Get tons of IPs
You can buy proxy IP addresses. This would be the most fool-proof method and would be pretty difficult (or atleast painful) for a public webservice to block.
Or, combine two or more of above. From personal experiences, I've never found a single web-service that could stop web-scraping attempts. BUT, in my use cases, I'd be very careful with legal and ethical concerns.

- 1,546
- 14
- 31
Use headers if you are using python requests. It will let you through the blocking.
Example :
import requests
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
'cookie': 'yourcookie',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 12239.92.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.136 Safari/537.36',
}
response = requests.get('https://quizlet.com/173246204/mgmt-final-exam-flash-cards/', headers=headers)
text = response.text
print(text)

- 1,459
- 4
- 18
- 34

- 29
- 1