Questions tagged [scrape]

DO NOT USE THIS TAG. It is under an active cleanup: https://meta.stackoverflow.com/q/305314 Use [web-scraping] if your question is about scraping information from web resources (there is also [screen-scraping]) or use [pdf-scraping] if your question is about scraping information from pdf files. Use [data-extraction] if you need to extract data from other resources.

1204 questions

votes

5 answers

Reading data from PDF files into R

Is that even possible!?! I have a bunch of legacy reports that I need to import into a database. However, they're all in pdf format. Are there any R packages that can read pdf? Or should I leave that to a command line tool? The reports were made…

asked Feb 07 '12 at 23:46

Justin

42,475
9
93
111

votes

3 answers

Extract / Identify Tables from PDF python

Are there any open source libraries that support table identification & extraction? By this I mean: Identify a table structure exists Classify the table from its contents Extract data from the table in a useful output format e.g. JSON / CSV…

python pdf scrape pdf-parsing pdf-scraping

asked Feb 16 '15 at 00:04

Alexander McFarlane

10,643
9
59
100

votes

3 answers

Parse Web Site HTML with JAVA

I want to parse a simple web site and scrape information from that web site. I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop. URL url = new…

java html scrape

asked Jan 30 '12 at 22:11

CanCeylan

2,890
8
41
51

votes

1 answer

curl 302 redirect not working (command line)

In the browser, navigating to this URL initiates a 302 (moved temporarily) request which in turn downloads a file. http://www.targetsite.com/target.php/?event=download&task_id=123 When I view what is actually happening via Chrome network tools I…

bash curl scrape

asked Jan 03 '14 at 13:53

user2029890

2,493
6
34
65

votes

2 answers

Wget Mirror HTML only

I have a small website that I try to mirror to my local machine with only the html file, no images, image attach files... pdf, ..etc. I have never mirrored a website before and think it would be a good idea to ask the question before doing anything…

wget scrape mirror

asked Aug 29 '13 at 16:34

B.Mr.W.

18,910
35
114
178

votes

3 answers

Scrapy, only follow internal URLS but extract all links found

I want to get all external links from a given website using Scrapy. Using the following code the spider crawls external links as well: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor from…

python web-crawler scrape scrapy

asked Jan 15 '15 at 13:22

sboss

votes

1 answer

Scrape / eavesdrop AJAX data using JavaScript?

Is it possible to use JavaScript to scrape all the changes to a webpage that is being updated live with AJAX? The site I wish to scrape updates data using AJAX every second and I want to grab all the changes. This is a auction website and several…

javascript ajax google-chrome-extension hook scrape

asked Dec 07 '12 at 14:27

user1885715

votes

1 answer

CheerioJS, looping through
with same class name

I'm trying to loop through each

. The thing is, it only takes the first

votes

2 answers

How to scrape JSON from puppeteer?

I login to a site and it gives a browser cookie. I go to a URL and it is a json response. How do I scrape the page after entering await page.goto('blahblahblah.json'); ?

node.js scrape puppeteer

asked Jan 29 '18 at 22:54

Amy Coin

votes

2 answers

Find next siblings until a certain one using beautifulsoup

The webpage is something like this:

section1

article

section2

article

How can I find each section with articles within them? That is, after finding h2,…

python find beautifulsoup scrape siblings

asked Jul 25 '12 at 10:05

user1550725

votes

2 answers

simple script to check if a webpage has been updated

There is some information that I am waiting for on a website. I do not wish to check it every hour. I want a script that will do this for me and notify me if this website has been updated with the keyword that I am looking for.

bash web scrape

asked Jan 31 '12 at 17:53

Programmer

votes

1 answer

rvest - scrape 2 classes in 1 tag

I am new to rvest. How do I extract those elements with 2 class names or only 1 class name in tag? This is my code and issue: doc <- paste("", "", " text1 ", "

html r web-scraping scrape rvest

asked Aug 02 '17 at 03:30

addicted

2,901
3
28
49

votes

3 answers

Facebook Object Debugger - Could not resolve the hostname into a valid IP address

There is a problem with how Facebook scrapes my page for meta data. When I use the Facebook object debugger I get the following error: I am quite sure this has something to do with how my DNS records are defined. It seems the scraper can't even…

facebook facebook-graph-api dns web-scraping scrape

asked Nov 18 '14 at 17:42

Yaron Levi

12,535
16
69
118

votes

2 answers

Get IP addresses from udp and http torrent tracker response

I am trying to get the peer-list: list of IP addresses from a torrent tracker Similar to the question here: how to get the peer list from torrent tracker response I wrote code which decodes the torrent files using the python bencode Bit-torrent…

python scrape bittorrent tracker

asked Nov 13 '13 at 19:29

Saher Ahwal

9,015
32
84
152

votes

2 answers

Scrapy Body Text Only

I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet. Wishing some scholars might be able to help me here scraping all the text from the tag.

python scrapy scrape scraper

asked Mar 22 '11 at 10:59

mmrs151

3,924
2
34
38

2 3

…

80 81 Next