1

First of all, thanks for the help, I´m stuck on this issue for one week. I google and searched it here, but have no Java response, only with Python and other language that I don´t know.

I´m using java to develop an application that search for a pair of string and get the text in the middle of these two words. The example:

<A name=1></a>Some text with break lines<A name=300></a>

The main issue is, I need to get the text between these two marcations until . Grabe this text and add it to a StringBuffer.

I did this:

Pattern regex   = Pattern.compile("<A name=1><\\/a>((.|\\s)+?)<A name=300><\\/a>");
Matcher matcher = regex.matcher(htmlFileReading);

if (matcher.find()) {
    System.out.println("Finded");
    System.out.println(matcher.groupCount());
}

It works, but when I try something bigger than, but not so big, it returns stack over flow error.

How can I get the text between these two marks? Thanks a lot, and sorry for my bad English.

digoferra
  • 1,001
  • 3
  • 17
  • 33
  • Doesn't this work? And btw, `(.|\\s)+?` is the same as `.+?`. – Keppil Jul 23 '12 at 13:53
  • Post the case where it doesn't work please. – Garrett Hall Jul 23 '12 at 13:57
  • It work, but give me Exception in thread "main" java.lang.StackOverflowError. The htmlFileReading is a HTML file with these marks text with break lines. I need to get the text in the middlle, but it give me the error. Thanks. – digoferra Jul 23 '12 at 14:07
  • This expression won't cause a StackOverflowError, you probably have some kind of endless recursion in your search method. Can you post it? – Keppil Jul 23 '12 at 14:12
  • It´s in the main post, I think if I change the If for a while, may fix it... – digoferra Jul 23 '12 at 14:14
  • 1
    Hi. The overwhelming recommendation here is that you don't parse HTML with regular expressions. See here for more '*helpful*' information: http://stackoverflow.com/a/1732454/626796 – Tharwen Jul 23 '12 at 14:14
  • @Tharwen that always makes me laugh. Thanks. But seriously OP. [jsoup HTML parser](http://jsoup.org/) will make short work of your HTML file in 5 minutes. Regular expressions are just horrid and unmaintainable when you try to do data extraction from HTML. – Strelok Jul 23 '12 at 14:27
  • Thanks but I´m reading it as txt file, but in the text file it has an HTML. It should not work just like a normal text?! – digoferra Jul 23 '12 at 14:32

2 Answers2

1

Not certain to be right, but try somethig like this to have 'light' recursion :

// .* before and after if needed
Pattern regex   = Pattern.compile(".*<A name=1><\\/a>(.*?)<A name=300><\\/a>.*");
System.output.println(regex.matcher(myStringToSearchInside).replaceAll("$1"));

Edited for newLine include

cl-r
  • 1,264
  • 1
  • 12
  • 26
0

If your objective is to extract the text from the xml, it's recommended to use XSLT

Joe M
  • 2,527
  • 1
  • 25
  • 25