3

I am trying to create a method to find and return the first tag in a given HTML string, and returns null if no such tag is found. (A tag would be something like <b>)

I looked through the String class methods but I can't find a method that can suit this purpose. I'm thinking my plan is to scan each word for a "<" then once it is found, scan for a ">", but am unsure of how to do so. Also wondering if I should put a while/for loop in there? Help is appreciated, thank you.

public class HTMLProcessor {

    public static void main(String[] args) {
    System.out.println(findFirstTag("<b>The man jumped.</b>"));
    }

    public static String findFirstTag(String text) {
    int firstIndex = text.indexOf("<");
    if (firstIndex >= 0) {
        String newText = text.substring(firstIndex);
        int secondIndex = newText.indexOf(">");

        return text.substring(firstIndex, secondIndex + 1);
    } else {
        return null;
    }

}
Freedom
  • 347
  • 2
  • 7
  • 24

3 Answers3

2

Use regular expressions.

Pattern p = Pattern.compile("<([A-Z][A-Z0-9]*)\\b[^>]*>(.*?)</\\1>"); 
Matcher m = p.matcher(yourText);

Will match things like <b>this is bold</b>

Pshemo
  • 122,468
  • 25
  • 185
  • 269
zero_dev
  • 613
  • 9
  • 17
  • 1
    Wrong language? This looks like C#, he is doing it in Java. – Josh M Jan 06 '14 at 18:16
  • Sorry for confusion with my previous comment. I assumed OP question was more complex. Anyway I thing it would be good if you reduce your regex to only match opening tag. This way you will also find elements with no body like ``. – Pshemo Jan 06 '14 at 18:38
2

You can try with indexOf() and lastIndexOf() methods from String class.

You definitely need a HTML parser, Just pick one. Jsoup is one the best html parser.

Considering you are doing this multiple times and places.

And do not prefer much for regex while dealing with html strings

Community
  • 1
  • 1
Suresh Atta
  • 120,458
  • 37
  • 198
  • 307
  • actually there is almost nothing available that really can guarantee to parse failure-free html-tokens. Try to [validate](http://validator.w3.org/) current HTML pages - plenty of pages (almost any news page I currently validate for my master thesis) fail the test as they don't contain valid HTML (some even include errors on purpose to prevent correct parsing or at least try to make it as hard as possible) – Roman Vottner Jan 06 '14 at 18:22
  • You can't back up the claim that regex can't parse html. The OP is only looking for the first tag, regex will do fine. He doesn't need `Jsoup` for just one tag. For instance, `/<.*?>/` would even work (on a limited dataset). – kddeisz Jan 06 '14 at 18:25
  • Ya, I may little harsh at regex, Agreed. No one is safe. Why I suggested Jsoup is, he might dealing with html parsing in a long run. Edited my post little. Thanks for the feedback. – Suresh Atta Jan 06 '14 at 18:27
1

Take a look at java regular expressions here. If you need an introduction to regex look here. This is probably the quickest way to accomplish what you're looking for.

kddeisz
  • 5,162
  • 3
  • 21
  • 44