Regex to get text in the middle of two marks

Question

First of all, thanks for the help, I´m stuck on this issue for one week. I google and searched it here, but have no Java response, only with Python and other language that I don´t know.

I´m using java to develop an application that search for a pair of string and get the text in the middle of these two words. The example:

<A name=1></a>Some text with break lines<A name=300></a>

The main issue is, I need to get the text between these two marcations until . Grabe this text and add it to a StringBuffer.

I did this:

Pattern regex   = Pattern.compile("<A name=1><\\/a>((.|\\s)+?)<A name=300><\\/a>");
Matcher matcher = regex.matcher(htmlFileReading);

if (matcher.find()) {
    System.out.println("Finded");
    System.out.println(matcher.groupCount());
}

It works, but when I try something bigger than, but not so big, it returns stack over flow error.

How can I get the text between these two marks? Thanks a lot, and sorry for my bad English.

Doesn't this work? And btw, `(.|\\s)+?` is the same as `.+?`. — Keppil, Jul 23 '12 at 13:53
It work, but give me Exception in thread "main" java.lang.StackOverflowError. The htmlFileReading is a HTML file with these marks text with break lines. I need to get the text in the middlle, but it give me the error. Thanks. — digoferra, Jul 23 '12 at 14:07
This expression won't cause a StackOverflowError, you probably have some kind of endless recursion in your search method. Can you post it? — Keppil, Jul 23 '12 at 14:12
It´s in the main post, I think if I change the If for a while, may fix it... — digoferra, Jul 23 '12 at 14:14
Hi. The overwhelming recommendation here is that you don't parse HTML with regular expressions. See here for more '*helpful*' information: http://stackoverflow.com/a/1732454/626796 — Tharwen, Jul 23 '12 at 14:14
@Tharwen that always makes me laugh. Thanks. But seriously OP. [jsoup HTML parser](http://jsoup.org/) will make short work of your HTML file in 5 minutes. Regular expressions are just horrid and unmaintainable when you try to do data extraction from HTML. — Strelok, Jul 23 '12 at 14:27
Thanks but I´m reading it as txt file, but in the text file it has an HTML. It should not work just like a normal text?! — digoferra, Jul 23 '12 at 14:32

cl-r · Accepted Answer · 2012-07-23T14:47:36.823

1

Not certain to be right, but try somethig like this to have 'light' recursion :

// .* before and after if needed
Pattern regex   = Pattern.compile(".*<A name=1><\\/a>(.*?)<A name=300><\\/a>.*");
System.output.println(regex.matcher(myStringToSearchInside).replaceAll("$1"));

Edited for newLine include

edited Jul 23 '12 at 14:47

answered Jul 23 '12 at 14:23

cl-r

1,264
1
12
26

I need to get the content at the (.*) to extract it and work with it. Thanks – digoferra Jul 23 '12 at 14:33
@RodrigoFerrari As it is, it extracts data between your tags, the first and last .* may be suppressed if you need only the central (.*). – cl-r Jul 23 '12 at 14:39
It did not get the break lines :( – digoferra Jul 23 '12 at 14:41

score 0 · Answer 2 · answered Jul 23 '12 at 14:14

0

If your objective is to extract the text from the xml, it's recommended to use XSLT

answered Jul 23 '12 at 14:14

Joe M

2,527
1
25
25

It´s a text file with html inside. – digoferra Jul 23 '12 at 14:33

Regex to get text in the middle of two marks

2 Answers2