0

I have about 3200 URLs to small XML files which have some data in the form of strings(obviously).The XML files are displayed(not downloaded) when I go to the URLs. So I need to extract some data from all those XMLs and save it in a single .txt file or XML file or whatever. How can I automate this process?

*Note: This is what the files look like. I need to copy the 'location' and 'title' from all of them and put them in one single file. Using what methodology can this be achieved?

<?xml version="1.0"?>
 -<playlist xmlns="http://xspf.org/ns/0/" version="1">
    -<tracklist>
    <location>http://radiotool.com/fransn.mp3</location> 
    <title>France, Paris radio 104.5</title> 
    </tracklist>
</playlist>

*edit: Fixed XML.

ankit rawat
  • 401
  • 2
  • 9
  • 22

2 Answers2

2

It's easy enough with XQuery or XSLT, though the details will depend on how the URLs are held. If they're in a Java List, then (with Saxon at least) you can supply this list as a parameter to the following query:

declare variable urls as xs:string* external;
<data>{
  for $u in $urls return doc($u)//*:tracklist
}</data>

The Java code would be something like:

Processor proc = new Processor();
XQueryCompiler c = proc.newXQueryCompiler();
XQueryEvaluator q = c.compile($query).load();
List<XdmItem> urls = new ArrayList();
for (url : inputUrls) {
  urls.append(new XdmAtomicValue(url);
}
q.setExternalVariable(new QName("urls"), new XdmValue(urls));
q.setDestination(...)
run();
Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • The URLs can be taken from a for loop, they are in a numerical pattern like this: www.abcd.1.xml; www.abcd.2.xml etc upto www.abcd.3200.xml. – ankit rawat Mar 13 '13 at 01:52
  • That's very easy then because you can generate them algorithmically with the query. – Michael Kay Mar 14 '13 at 15:54
0

Have a look at the JSoup library here: http://jsoup.org/

It has facilities for pulling and fixing the contents of a URL, it is intended for HTML though, so I'm not sure it will be good for XML, but it is worth a look.

Chris Cooper
  • 4,982
  • 1
  • 17
  • 27