DO NOT USE THIS TAG. It is under an active cleanup: https://meta.stackoverflow.com/q/305314 Use [web-scraping] if your question is about scraping information from web resources (there is also [screen-scraping]) or use [pdf-scraping] if your question is about scraping information from pdf files. Use [data-extraction] if you need to extract data from other resources.
Questions tagged [scrape]
1204 questions
51
votes
5 answers
Reading data from PDF files into R
Is that even possible!?!
I have a bunch of legacy reports that I need to import into a database. However, they're all in pdf format. Are there any R packages that can read pdf? Or should I leave that to a command line tool?
The reports were made…

Justin
- 42,475
- 9
- 93
- 111
51
votes
3 answers
Extract / Identify Tables from PDF python
Are there any open source libraries that support table identification & extraction?
By this I mean:
Identify a table structure exists
Classify the table from its contents
Extract data from the table in a useful output format e.g. JSON / CSV…

Alexander McFarlane
- 10,643
- 9
- 59
- 100
46
votes
3 answers
Parse Web Site HTML with JAVA
I want to parse a simple web site and scrape information from that web site.
I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop.
URL url = new…

CanCeylan
- 2,890
- 8
- 41
- 51
30
votes
1 answer
curl 302 redirect not working (command line)
In the browser, navigating to this URL initiates a 302 (moved temporarily) request which in turn downloads a file.
http://www.targetsite.com/target.php/?event=download&task_id=123
When I view what is actually happening via Chrome network tools I…

user2029890
- 2,493
- 6
- 34
- 65
18
votes
2 answers
Wget Mirror HTML only
I have a small website that I try to mirror to my local machine with only the html file, no images, image attach files... pdf, ..etc.
I have never mirrored a website before and think it would be a good idea to ask the question before doing anything…

B.Mr.W.
- 18,910
- 35
- 114
- 178
16
votes
3 answers
Scrapy, only follow internal URLS but extract all links found
I want to get all external links from a given website using Scrapy. Using the following code the spider crawls external links as well:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from…

sboss
- 957
- 1
- 7
- 21
16
votes
1 answer
Scrape / eavesdrop AJAX data using JavaScript?
Is it possible to use JavaScript to scrape all the changes to a webpage that is being updated live with AJAX? The site I wish to scrape updates data using AJAX every second and I want to grab all the changes. This is a auction website and several…

user1885715
- 191
- 2
- 8
15
votes
1 answer
CheerioJS, looping through with same class name
I'm trying to loop through each
- and get the value of each
- . The thing is, it only takes the first
- and skips the rest.
HTML
- tip11
- tip12 …
- 2,927
- 9
- 37
- 54
Sobiaholic
13
votes
2 answers
How to scrape JSON from puppeteer?
I login to a site and it gives a browser cookie.
I go to a URL and it is a json response.
How do I scrape the page after entering await page.goto('blahblahblah.json');
?

Amy Coin
- 141
- 1
- 1
- 4
13
votes
2 answers
Find next siblings until a certain one using beautifulsoup
The webpage is something like this:
section1
article
article
article
section2
article
article
article
How can I find each section with articles within them? That is, after finding h2,…
user1550725
- 177
- 1
- 1
- 6
12
votes
2 answers
simple script to check if a webpage has been updated
There is some information that I am waiting for on a website. I do not wish to check it every hour. I want a script that will do this for me and notify me if this website has been updated with the keyword that I am looking for.

Programmer
- 421
- 1
- 7
- 21
12
votes
1 answer
rvest - scrape 2 classes in 1 tag
I am new to rvest. How do I extract those elements with 2 class names or only 1 class name in tag?
This is my code and issue:
doc <- paste("",
"",
" text1 ",
"

addicted
- 2,901
- 3
- 28
- 49
12
votes
3 answers
Facebook Object Debugger - Could not resolve the hostname into a valid IP address
There is a problem with how Facebook scrapes my page for meta data.
When I use the Facebook object debugger I get the following error:
I am quite sure this has something to do with how my DNS records are defined. It seems the scraper can't even…

Yaron Levi
- 12,535
- 16
- 69
- 118
11
votes
2 answers
Get IP addresses from udp and http torrent tracker response
I am trying to get the peer-list: list of IP addresses from a torrent tracker
Similar to the question here: how to get the peer list from torrent tracker response
I wrote code which decodes the torrent files using the python bencode Bit-torrent…

Saher Ahwal
- 9,015
- 32
- 84
- 152
10
votes
2 answers
Scrapy Body Text Only
I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet.
Wishing some scholars might be able to help me here scraping all the text from the tag.

mmrs151
- 3,924
- 2
- 34
- 38