0

I am currently have a program that can find all the regexs that are in a string, however for a different part I want the parts that match the regex and the parts that don't.

So if I had <h1> hello world </h1> I would want to be able to split it up into [<h1>, hello world, </h1>].

Does anyone have any ideas on how to they would go about this?

Here is my code that splits up the string to find the regex part

ArrayList<String> foundTags = new ArrayList<String>();
Pattern p = Pattern.compile("<(.*?)>");
Matcher m = p.matcher(HTMLLine);
while(m.find()){
    foundTags.add(m.group(0));
}
h3n
  • 880
  • 1
  • 10
  • 26
Tall Paul
  • 2,398
  • 3
  • 28
  • 34
  • See this post: http://stackoverflow.com/a/1732454/1154145 – nattyddubbs Mar 26 '13 at 01:59
  • @nattyddubbs Yeah normally I would agree with you except that I already have a function that can successfully tell if its html or just text. I am looking to find a way to split of the string by everytime I find a regex that matches the criteria listed above. where "

    hello world

    " = [ h1, hello world, /h1]. The logic of how to tell if its html is already written and tested in another part of the code
    – Tall Paul Mar 26 '13 at 02:13
  • 1
    Valid html: `

    <>`. I'm just saying text processing on Html isn't that reliable. Continue at your own risk...

    – nattyddubbs Mar 26 '13 at 02:18
  • @leonbloy I thought of that but I was not sure how to write that with the regex as the splitter – Tall Paul Mar 26 '13 at 02:18
  • @nattyddubbs I would agree with you but the problem that I am trying to solve will be giving me HTML in a text format and I need to figure out how to parse though it correctly. Once I find either text or a HTML tag I am creating it into an object that will make it easier to work with. – Tall Paul Mar 26 '13 at 02:20

2 Answers2

0

For example:

String text = "testing<hi>bye</hi><b>bla bla!";
Pattern p = Pattern.compile("<(.*?)>");
Matcher m = p.matcher(text);
int last_match = 0;
List<String> splitted=new ArrayList<>();
while (m.find()) {
        splitted.add(text.substring(last_match,m.start()));
        splitted.add(m.group());
        last_match = m.end();
    }
    splitted.add(text.substring(last_match));
System.out.println(splitted.toString());

prints [testing, <hi>, bye, </hi>, , <b>, bla bla!]

Is that what you want? You can easily fix the code to omit empty elements if you don't want them:

while (m.find()) {
    if(last_match != m.start())
        splitted.add(text.substring(last_match,m.start()));
    splitted.add(m.group());
    last_match = m.end();
}
if(last_match != text.length())
    splitted.add(text.substring(last_match));

Bear in mind, as pointed out in the comments: using regex to parse arbitrary HTML/XML is in general a bad idea.

Community
  • 1
  • 1
leonbloy
  • 73,180
  • 20
  • 142
  • 190
  • yeah, I know its not the best way to parse HTML but its the required way and if I am given an invalid tag I have a method to check each tag. Thanks! – Tall Paul Mar 26 '13 at 02:33
0

You can use the regex grouping ability to retrieve the different parts of the match. For example:

ArrayList<String> list = new ArrayList<String>();
Pattern p = Pattern.compile("(<.*?>)(.*)(<.*?>)");
Matcher m = p.matcher("<h1> Hello World </h1>");
while(m.find()){
    list.add(m.group(1));
    list.add(m.group(2));
    list.add(m.group(3));
}

Would give you the list you wanted: ["<h1>", " Hello World ", "</h1>"]. Note that group number 0 is the full matched expression.

Jimmy Lee
  • 1,001
  • 5
  • 12
  • Is there a way to find the number of groups? – Tall Paul Mar 26 '13 at 02:55
  • I believe the number of groups depends on how many groups you make. In the regex, parens separate groups, so in the expression above, `(<.*?>)(.*)(<.*?>)`, there are 3 sets of parens so 3 groups (4 if you count the whole expression). – Jimmy Lee Mar 26 '13 at 03:01