How to use Java regex to extract ONLY tag which has some text into it

Question

I want to count all tags which has some text in it. I don't want to count tag which has only spaces and no text in it. For ex- in the program below also counts first tag which only has spaces in it and there is no value. So the output is- Total occurrences are -3 [ , orange gsdggg , p wfwfw ear]. Which is wrong since it should have given only 2 occurrences and -[orange gsdggg , p wfwfw ear].

Please help me to figure this out.

My program is -

        public static void main(String[] args) {
        String source1;
        source1="<tag>              </tag>          <b>hello</b>      <tag>       orange  gsdggg  </tag>  <tag>p wfwfw   ear</tag>";

        System.out.println(Arrays.toString(getTagValues(source).toArray())); 
}


private static List<String> getTagValues(String str) {

        if (str.toString().indexOf("&amp;") != -1) 
          {   
            str = str.toString().replaceAll("&amp;", "&");// replace &amp; by &
            //  System.out.println("removed &amp formatted--" + source);
          } 


        final Pattern TAG_REGEX = Pattern.compile("<tag>(.+?)</tag>");
        final List<String> tagValues = new ArrayList<String>();
        final Matcher matcher = TAG_REGEX.matcher(str);
        int count=0;
        while (matcher.find()) {
            tagValues.add(matcher.group(1));
            count++;
        }
        System.out.println("Total occurance is -" + count);
        return tagValues;

}

[Why are you using regex to parse HTML?](http://stackoverflow.com/q/1732348/1393766) — Pshemo, Apr 01 '15 at 18:41

Pshemo · Accepted Answer · 2015-04-01T19:00:31.173

0

You can simply check if after trimming your string it will not be empty like

while (matcher.find()) {
    if (!matcher.group(1).trim().isEmpty()){
        tagValues.add(matcher.group(1));
        count++;
    }
}

Anyway you should avoid using regex to parse XML or HTML. Instead you should be using parser. One of easiest to work with (at least IMO) is jsoup in which your entire code can look like:

String source = "<tag>              </tag>          <b>hello</b>      <tag>       orange  gsdggg  </tag>  <tag>p wfwfw   ear</tag>";

Document doc = Jsoup.parse(source);
Elements elements = doc.select("tag:matches(\\S)");//finds <tag> which text contains at least one non-whitespace character

System.out.println("Total occurance is -" + elements.size());
for (Element el: elements){
    System.out.println(el.text());
}

edited Apr 01 '15 at 19:00

answered Apr 01 '15 at 18:53

Pshemo

122,468
25
185
269

Hi @Pshemo. I confirmed part 1 which works fine in my code. Actually I am using webdriver to get outerHtml value and then performing test on that value. In the above case I was trying to take care of the scenario when any tag is empty. Do you suggest I should use Jsoup? I have a deadline and have no idea about Jsoup. – For Testing Apr 01 '15 at 19:39
I am not sure what you are trying to achieve. If your page will contain any JavaScript code (for instance it will generate some HTML) when simple parser like jsoup will not help you because it doesn't have JS engine (it is not browser emulator like web-driver). Anyway general idea is not to use regex for parser's job. – Pshemo Apr 01 '15 at 19:45
Thanks a lot for replying. In short I want to check some values on the page for example pixels (which will be in script tag) or title tag value and verify its existence on the page. – For Testing Apr 01 '15 at 20:43
I am sorry but I still am not sure what you mean by "*I want to check some values on the page for example pixels*" and how they can be involved with `script` or `title` tags. How would you like to verify their existence? Also what do you mean by "existence"? (I am not HTML developer so I may be not best person to help you with this problem, since it starts to be different from what your original question was about). – Pshemo Apr 01 '15 at 20:56
for example I want to verify following value on the page source --> and to verify this I was just trying to make sure that it is present in the source and it has value. Thanks for your help may be its not related to the question I have asked here. I will write a new question. – For Testing Apr 02 '15 at 13:01

How to use Java regex to extract ONLY tag which has some text into it

1 Answers1