Questions tagged [webharvest]

Web-Harvest is Open Source Web Data Extraction tool written in Java.

Web-Harvest is Open Source Web Data Extraction tool written in Java.

It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.

71 questions
7
votes
2 answers

web scraping java beginner

I am new to Java, I would like to become really good in web scraping and parsing data Are there any sites related to web scraping that would help me understand the how the APIs like htmcleaner, web-harvest, htmlparser work?? I'm still not proficient…
scorpy
  • 81
  • 1
  • 1
  • 4
2
votes
1 answer

how to fix groovy.lang.MissingMethodException: No signature of method: java.util.ArrayList.get() is applicable for argument types: () values: []

I trying to use this method in groovy, groupedDocs = reader.selectGroupedDocs(last_update_date.toString()).get(); And this is my java code part for "selectGroupedDocs" method, private List> selectGroupedDocs(String…
2
votes
1 answer

Trying to grab information in Child Link using WebHarvest

I would like to grab the information of each child link, but the program shows error. Below are my full config file. The error is Caused by: org.xml.sax.SAXParseException; lineNumber: 6; columnNumber: 724; Element type "t.length" must be followed by…
Jazz
  • 21
  • 3
2
votes
3 answers

Web scraping in PHP - working with some URLs but fails with others

I am doing web scraping with curl for a linkedin profile page. If we try to extract data from this(http://in.linkedin.com/in/ratneshdwivedi) URL which is public, it's working. When I am logged in to linkedin and trying to harvesting data from this…
ratnesh dwivedi
  • 352
  • 2
  • 8
2
votes
2 answers

web harvest - scraping an url

I am using web harvest. However, I want to scrape data from the URL: http://derstandard.at/anzeiger/immoweb/Suchergebnis.aspx?Regionen=9&Bezirke=&Arten=&AngebotTyp=×tamp=1363305908912 My code is:
user2051347
  • 1,609
  • 4
  • 23
  • 34
2
votes
2 answers

webharvest not retrieving data

I have webharvest running without errors, but when I open the XML file it does not have the right data, it just prints it out. here is my code:
stacktraceyo
  • 1,235
  • 4
  • 16
  • 22
2
votes
2 answers

Reading dynamic web page content in java

I need help reading the contents of a webpage. Currently i am using the following method to read the contents BufferedReader in = new BufferedReader(new InputStreamReader(page.openStream())); String inputLine; while ((inputLine = in.readLine()) !=…
Rajeshwar
  • 11,179
  • 26
  • 86
  • 158
1
vote
1 answer

Setting http timeout to jakarta HttpClient

I'm using the code below in WebHarvest configuration file to define timeout for http element in WebHarvest (Webharvest uses Jakarta HttpClient). But while I'm setting it to 20000 it takes about 40-50 seconds until timeout get reached! And when I set…
Ariyan
  • 14,760
  • 31
  • 112
  • 175
1
vote
0 answers

Web harvest failing to convert malformed html to xml

I am using xquery processor in web harvest (from java) to parse an html page that contains an invalid tag inside a
element, like
. The exception is: SXXP0003: Error reported by XML parser: Element type "div" must be followed by…
1
vote
2 answers

What are some good java libraries to search and scrape data out of a web page.

What are some good open source java libraries to search and scrape data out of a web page and stick it into a database. For example, suppose I had a page such as: Address: 123 My Street …
JStark
  • 2,788
  • 2
  • 29
  • 37
1
vote
0 answers

Web scraping with rvest - login not working - flightradar24.com

I'm trying to harvest data from www.flightradar24.com using rvest. I've got a subscription, so I want to log in and get access to more data. This is the code I'm using to log in (I'm using my email and password instead of "email" and…
Eloy
  • 19
  • 3
1
vote
0 answers

http webharvest tag not working with http-param parameters

I am trying the below code in webharvest https://www.athome.com/on/demandware.store/Sites-athome-Site/default/Stores-FindByZip?
1
vote
0 answers

Invoke webharvest function from JavaScript function

I have created a webharvest function. I am able to invoke the function using webharvest code. My challenge is, need to invoke that webharvest function from a JavaScript function. Is it possible? For example, consider this: Webharvest method …
Sunil Prabakar
  • 442
  • 1
  • 5
  • 19
1
vote
1 answer

Getting data from Wep Pages using Jsoup Java

Its my first qustion in this site and hope to stay longer :=) I have read a lot of article and examine many kind of example about taking specific datas from web site using Jsoup. Alread, I could manage to get some values but I couldn't succed my…
samio
  • 11
  • 3
1
vote
1 answer

XPath for text after a div?

How could I extract the number "-105" with XPath 1.0/2.0?
-115
Fernando
  • 23
  • 4
1
2 3 4 5