0

I have following link

https://hero.epa.gov/hero/ws/swift.cfc?method=getProjectRIS&project_id=993&getallabstracts=true

I want to parse this xml to get only text, like

Provider: HERO - 2.xx
DBvendor=EPA
Text-encoding=UTF-8

How can I parse it ?

user1631306
  • 4,350
  • 8
  • 39
  • 74
  • If you view the source code of that page, you'll see the [wddxPacket](https://en.wikipedia.org/wiki/WDDX) you mentioned. You *might* be able to [parse it as XML](https://stackoverflow.com/questions/3962866/what-is-the-easiest-way-to-extract-plain-text-from-an-xml-document)... though I haven't tried. – showdev May 23 '17 at 18:34
  • you can install ARC ([Advanced Rest Client](https://chrome.google.com/webstore/detail/advanced-rest-client/hgmloofddffdnphfgcellkdfbfbjeloo?utm_source=chrome-app-launcher-info-dialog)) from the chrome webstore to get more influence into the headers sent and see the request and response headers and content. – cyberbrain May 23 '17 at 18:55

3 Answers3

2

Well, it's not a text file, it's an HTML file. If you open a file in browser and select view source you will be able to see text enclosed in <char> tags.

When it's opened in browser, these tags and other HTML content is interpreted and output is rendered on the page (that's why it looks like a text). If you want to implement similar behavior in Java then you should look into PhantomJS and/or JSoup examples.

Darshan Mehta
  • 30,102
  • 11
  • 68
  • 102
0

It looks like a text file but it is an XML file and the browser just displays its text content. To verify right click and look at the page source.

wero
  • 32,544
  • 3
  • 59
  • 84
0

You can use a library like Jsoup for parsing the file and getting the contents.

https://jsoup.org/cookbook/introduction/parsing-a-document

Metalhead
  • 1,429
  • 3
  • 15
  • 34