Questions tagged [scrape]

DO NOT USE THIS TAG. It is under an active cleanup: https://meta.stackoverflow.com/q/305314 Use [web-scraping] if your question is about scraping information from web resources (there is also [screen-scraping]) or use [pdf-scraping] if your question is about scraping information from pdf files. Use [data-extraction] if you need to extract data from other resources.

1204 questions
51
votes
5 answers

Reading data from PDF files into R

Is that even possible!?! I have a bunch of legacy reports that I need to import into a database. However, they're all in pdf format. Are there any R packages that can read pdf? Or should I leave that to a command line tool? The reports were made…
Justin
  • 42,475
  • 9
  • 93
  • 111
51
votes
3 answers

Extract / Identify Tables from PDF python

Are there any open source libraries that support table identification & extraction? By this I mean: Identify a table structure exists Classify the table from its contents Extract data from the table in a useful output format e.g. JSON / CSV…
Alexander McFarlane
  • 10,643
  • 9
  • 59
  • 100
46
votes
3 answers

Parse Web Site HTML with JAVA

I want to parse a simple web site and scrape information from that web site. I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop. URL url = new…
CanCeylan
  • 2,890
  • 8
  • 41
  • 51
30
votes
1 answer

curl 302 redirect not working (command line)

In the browser, navigating to this URL initiates a 302 (moved temporarily) request which in turn downloads a file. http://www.targetsite.com/target.php/?event=download&task_id=123 When I view what is actually happening via Chrome network tools I…
user2029890
  • 2,493
  • 6
  • 34
  • 65
18
votes
2 answers

Wget Mirror HTML only

I have a small website that I try to mirror to my local machine with only the html file, no images, image attach files... pdf, ..etc. I have never mirrored a website before and think it would be a good idea to ask the question before doing anything…
B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178
16
votes
3 answers

Scrapy, only follow internal URLS but extract all links found

I want to get all external links from a given website using Scrapy. Using the following code the spider crawls external links as well: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor from…
sboss
  • 957
  • 1
  • 7
  • 21
16
votes
1 answer

Scrape / eavesdrop AJAX data using JavaScript?

Is it possible to use JavaScript to scrape all the changes to a webpage that is being updated live with AJAX? The site I wish to scrape updates data using AJAX every second and I want to grab all the changes. This is a auction website and several…
user1885715
  • 191
  • 2
  • 8
15
votes
1 answer

CheerioJS, looping through

I'm trying to loop through each
    and get the value of each
  • . The thing is, it only takes the first
13
votes
2 answers

How to scrape JSON from puppeteer?

I login to a site and it gives a browser cookie. I go to a URL and it is a json response. How do I scrape the page after entering await page.goto('blahblahblah.json'); ?
Amy Coin
  • 141
  • 1
  • 1
  • 4
13
votes
2 answers

Find next siblings until a certain one using beautifulsoup

The webpage is something like this:

section1

article

article

article

section2

article

article

article

How can I find each section with articles within them? That is, after finding h2,…
user1550725
  • 177
  • 1
  • 1
  • 6
12
votes
2 answers

simple script to check if a webpage has been updated

There is some information that I am waiting for on a website. I do not wish to check it every hour. I want a script that will do this for me and notify me if this website has been updated with the keyword that I am looking for.
Programmer
  • 421
  • 1
  • 7
  • 21
12
votes
1 answer

rvest - scrape 2 classes in 1 tag

I am new to rvest. How do I extract those elements with 2 class names or only 1 class name in tag? This is my code and issue: doc <- paste("", "", " text1 ", "
addicted
  • 2,901
  • 3
  • 28
  • 49
12
votes
3 answers

Facebook Object Debugger - Could not resolve the hostname into a valid IP address

There is a problem with how Facebook scrapes my page for meta data. When I use the Facebook object debugger I get the following error: I am quite sure this has something to do with how my DNS records are defined. It seems the scraper can't even…
Yaron Levi
  • 12,535
  • 16
  • 69
  • 118
11
votes
2 answers

Get IP addresses from udp and http torrent tracker response

I am trying to get the peer-list: list of IP addresses from a torrent tracker Similar to the question here: how to get the peer list from torrent tracker response I wrote code which decodes the torrent files using the python bencode Bit-torrent…
Saher Ahwal
  • 9,015
  • 32
  • 84
  • 152
10
votes
2 answers

Scrapy Body Text Only

I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet. Wishing some scholars might be able to help me here scraping all the text from the tag.
mmrs151
  • 3,924
  • 2
  • 34
  • 38
1
2 3
80 81