-2

I want to extract the string between <a: href> and </a: href> from the following:

<a: href> https://0.0.0.1/abcd/openthis.pdf </a: href>

using StringTokenizer, split or scanner.
I'm trying to use StringTokenizer with <a: href> and </a: href> as delimiters but its not working. I tried to escape <, > and :, but this doesn't seem to be the problem. My guess is that it won't accept a word or a phrase as a delimiter.

Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
user967850
  • 31
  • 6

1 Answers1

0

You can give Regex a try.

Try this regex >\s+(.*?)\s+<'.

Please keep one thing in mind the regex solution will only work if you have extracted this string

< a: href > https://0.0.0.1/abcd/openthis.pdf < /a: href>

In general use html parsers to extract the text from the corresponding html code.

Here is a reason why you should not parse HTML with regex.

I would give htmlcleaner a try.

HTMLCleaner is Java library used to safely parse and transform any HTML found on web to well-formed XML. It is designed to be small, fast, flexible and independant. HtmlCleaner may be used in java code, as command line tool or as Ant task. Result of parsing is lightweight document object model which can easily be transformed to standards like DOM or JDom, or serialized to XML output in various ways (compact, pretty printed and so on).

You can use XPath with htmlcleaner to get contents within xml/html tags.Here is a nice
example Xpath Example

Community
  • 1
  • 1
RanRag
  • 48,359
  • 38
  • 114
  • 167