read information off website and store in excel file

Question

I am trying to build this application that when provided a .txt file filled with isbn numbers will visit the isbn.nu page for that isbn number by simply appending the isbn to the url www.isbn.nu/your isbn number.

After pulling up the page, I want to scan it for information about the book, and store that in an excel file.

I was thinking about creating a file stream of the url in Java, but I am not really sure how to extract the information from the html page. Storing the information will be done using the JExcel Java package.

My best guess would be using javascript to extract the information, but I don't know how to call the javascript from my java program.

Is my idea plausible? if not, what do you guys suggest I do.

my goal: retrieve information from an html page and store it in an excel file for each ISBN in a text file. There can be any number of isbn's in a text file.

This isn't homework btw, I am simply doing this for an organization that donates books to Sudan. Currently they have 5 people cataloging these books manually and I am one of them.

Heh, this has to be the first time I've seen a question tagged with both [java] and [javascript] and it wasn't a beginner's mistake. Nice. :) — sarnold, Feb 03 '12 at 00:20

score 3 · Answer 1 · edited Feb 03 '12 at 00:19

3

Jsoup is a useful tool for parsing a web page and getting data from it. You can do it in Java and it's pretty easy.

You can parse the text file, build the URL with a string, send it in with JSoup then use JSoup to parse out the information using the html tags on the page. Then you can store it out however you want. You really don't need to use Javascript at all if you're more comfortable with Java.

Example for reading a page and parsing it with Jsoup:

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

edited Feb 03 '12 at 00:19

sarnold

102,305
22
181
238

answered Feb 03 '12 at 00:02

AHungerArtist

9,332
17
73
109

Thanks so much, this seems like my best option right now. Or atleast the easiest. I will try an study the source code to see how they implement this stuff. – user1022223 Feb 03 '12 at 00:23
@user1022223 If you just want to learn about it by looking at the source code, that's cool, but it is free to use. Don't go implementing it yourself :) It should be easy to add into any Java project, like any other jar. – AHungerArtist Feb 03 '12 at 00:37
Btw, if this does end up working for you, feel free to accept the answer :) – AHungerArtist Feb 03 '12 at 04:12

score 2 · Answer 2 · answered Feb 03 '12 at 00:02

Use a div in which you load your link (example here how to do that http://api.jquery.com/load/).

After that when load is complete you can check what is the name of the div's or spans used in the webpage and get that content with val (http://api.jquery.com/val/) or text (http://api.jquery.com/text/)

score 1 · Answer 3 · answered Feb 03 '12 at 00:39

Here is text from the main page of www.isbn.nu:

Please note that isbn.nu is designed for manual searching by individuals. It is not intended as an information resource for automated retrieval, nor as a research tool for companies. isbn.nu reserves the right to deny access based on excessive requests.

Why not just use the free Google books API that would return book details in XML format. There are many classes available in Java to parse XML feeds and would make your life much easier.

See http://code.google.com/apis/books/ for more info.

score 0 · Answer 4 · edited May 23 '17 at 11:56

0

Here are the steps needed:

Create CURL request (you can use multiple curl requests)
Get body data
Parse data
Make excel file

You can read HTML information using this guide.

edited May 23 '17 at 11:56

Community

1
1

answered Feb 02 '12 at 23:59

Fedya Skitsko

337
1
10

Fedya, please bare with me on this one as I am a complete novice when it comes to web programming. Most of my work is in Java and C. that being said. Is there a way to continuously make curl requests for different isbns. For instance if Java had a library to read html files (which it might, i need to check on that.) I would simply do something like while(!end of file containing isbns) { open stream to html page, get info, store info} – user1022223 Feb 03 '12 at 00:02
I am not shure I can help you, because I am PHP developer. But, I know that CURL is crossplatform library and you can find how to make multi-requests. – Fedya Skitsko Feb 03 '12 at 00:36

score 0 · Answer 5 · answered Feb 03 '12 at 01:01

A simple solution might be to use a Google Docs spreadsheet function like ImportXML(URL,path-expression).

More information and examples here:

read information off website and store in excel file

5 Answers5