1

So this time in my scraping escapades I've encountered a new foe - a website which deters scrapers by "transforming" the price data everyone would like to scrape into SVG images. A simple question - what is the "preferred" tool or method of scraping such a site continously? I thought of downloading full page screenshots with Selenium (with stealth, since the site also has cloudflare scrape detection) and OCR'ing it with tesseract but downloading alone takes about 7 seconds per page (and I have 180 of them to scrape) so while that isn't completely unworkable, it is below expectations, so to speak.

My question is, what are the general methods, techniques or tools I should be looking at to tackle this task? Is there a way of OCR'ing the SVGs directly on the site without having to download them somehow/making screenshots? Or what should I be looking at?

for reference, what I'm trying to scrape is for example this - https://www.goatbots.com/set/kaldheim , the "buy" and "sell" columns

Entman
  • 637
  • 1
  • 6
  • 12

1 Answers1

0

You could try taking the screenshots of the price elements only instead of taking complete page screenshot. Check this post for partial screenshots

As for OCR'ing it with tesseract is the best free option.

For cloudflare use chrome undetected driver for python which is very much successful in bypassing cloudflare.