How can i download .xml files and parse a webpage(e.g. html) using Java?

Question

I have a work to do. Firstly, my program will be given an argument. This argument will contain Websites where I can find RSS Feeds(for instance: CnnRssFeeds). Then, I have to visit these sites and download the RSS Feeds(I think these files would usually be .xml files, right?).

After, I have to save the .xml files in a folder on my disk and finally I have to manage them using the Rome library of Java. I will extract some information such as: title, author, description, link etc.

Could you help me? I have some trouble when I want to visit each site and download(save) the RSS (as I said above, they are usually .xml files)

@AlexR : When i visit a Website where i can find RssFeeds, such as the one i posted above, i want to download the RssFeeds(they are usually .xml files). How can i download these RssFeeds?? — limas, Dec 18 '11 at 11:44
@limas - A tiny bit of research (e.g. a simple search of SO) would have answered these questions. — Stephen C, Dec 18 '11 at 11:56
possible duplicate of [How to download and save a file from internet using Java](http://stackoverflow.com/questions/921262/how-to-download-and-save-a-file-from-internet-using-java) — Stephen C, Dec 18 '11 at 11:57
@ Stephen C: I checked the link. Thank you for your time. I appreciate it. — limas, Dec 18 '11 at 12:27

score 3 · Accepted Answer · edited May 23 '17 at 12:15

3

For downloading files you can use the first answer of this question.(I have tried it works :))

And for parsing XML u can use XPath.XPath is used to navigate through elements and attributes in an XML document.This tutorial of XPath seems to be pretty well

edited May 23 '17 at 12:15

Community

1
1

answered Dec 18 '11 at 11:43

narek.gevorgyan

4,165
5
32
52

@ narek.gevorgyan: Thank you. I will try to solve my problem using your advice or i will get back posting my trouble. – limas Dec 18 '11 at 11:55
@ narek.gevorgyan: Thank you again. Your post helped me and is exactly the one part of a process i need to do. – limas Dec 18 '11 at 12:26

score 1 · Answer 2 · answered Dec 18 '11 at 12:01

1

why so many question marks? If you know to visit site you do not have a problem to download content of any resource. Your problem is parse HTML and extract the URL of RSS feed. The feed is embedded into the HTML page using link tag:

<link rel="alternate" type="application/rss+xml" title="My Feed" href="/feeds/myfeed" />

So, you have to parse the HTML. There are several ways to do this. For example you can use jsoup or other you like. Once you are able to parse HTML you can extract value of the href attribute (/feeds/myfeed in our example.) Now just construct the full URL (concatenate URL of your page with /feeds/myfeed and download the resource.

answered Dec 18 '11 at 12:01

AlexR

114,158
16
130
208

@ AlexR: Thank you. This is the one part of an answer i am lookinf for solving my problem. It worked ;-) – limas Dec 18 '11 at 12:24
@limas, what's the second part you are looking for? – AlexR Dec 18 '11 at 12:31
The second part was to download an .xml or .html page. This issue was answered just in the post below. Could i ask you something about parsing a webpage?? I want to extract only the rss hrefs among all the hrefs containing in a page. Is there any attribute to differ rss hrefs from other hrefs?? I want to get rss hrefs from different rssFeeds Websites. – limas Dec 18 '11 at 13:36
According to the structure of rss hrefs , i can see that the name of a rss shown on a Website is placed just before the suffix: /rss.xml. So, is this a way to identify rss hrefs ?? – limas Dec 18 '11 at 13:46
Actually, i realised that there is no a general strategy to find out the rss links of every rssFeeds website. It depends on the structure of the website. Finally, i would like to thank you for everything. – limas Dec 18 '11 at 15:26

How can i download .xml files and parse a webpage(e.g. html) using Java?

2 Answers2