HTML tag finder

Question

I am trying to create a method to find and return the first tag in a given HTML string, and returns null if no such tag is found. (A tag would be something like <b>)

I looked through the String class methods but I can't find a method that can suit this purpose. I'm thinking my plan is to scan each word for a "<" then once it is found, scan for a ">", but am unsure of how to do so. Also wondering if I should put a while/for loop in there? Help is appreciated, thank you.

public class HTMLProcessor {

    public static void main(String[] args) {
    System.out.println(findFirstTag("<b>The man jumped.</b>"));
    }

    public static String findFirstTag(String text) {
    int firstIndex = text.indexOf("<");
    if (firstIndex >= 0) {
        String newText = text.substring(firstIndex);
        int secondIndex = newText.indexOf(">");

        return text.substring(firstIndex, secondIndex + 1);
    } else {
        return null;
    }

}

Look at the `String` class, I'm sure you could find the `indexOf(String)` of each `<` with its corresponding `>`. — Josh M, Jan 06 '14 at 18:11
`for (String s : "\n\n\n\n\n".split("\\s")) if (s.matches("<.?*>")) // you found your/a tag` — Roman Vottner, Jan 06 '14 at 18:16
Could you check to see if my revised code is right? Thanks for helping people — Freedom, Jan 06 '14 at 18:24
@LoyalKnight My bad, I assumed you also want to find content of tag which shouldn't be done with regex. I removed my comment. — Pshemo, Jan 06 '14 at 18:34

score 2 · Answer 1 · edited Jan 06 '14 at 18:24

2

Use regular expressions.

Pattern p = Pattern.compile("<([A-Z][A-Z0-9]*)\\b[^>]*>(.*?)</\\1>"); 
Matcher m = p.matcher(yourText);

Will match things like <b>this is bold</b>

edited Jan 06 '14 at 18:24

Pshemo

122,468
25
185
269

answered Jan 06 '14 at 18:14

zero_dev

613
9
17

1

Wrong language? This looks like C#, he is doing it in Java. – Josh M Jan 06 '14 at 18:16
Sorry for confusion with my previous comment. I assumed OP question was more complex. Anyway I thing it would be good if you reduce your regex to only match opening tag. This way you will also find elements with no body like ``. – Pshemo Jan 06 '14 at 18:38

score 2 · Accepted Answer · edited May 23 '17 at 12:28

2

You can try with indexOf() and lastIndexOf() methods from String class.

You definitely need a HTML parser, Just pick one. Jsoup is one the best html parser.

Considering you are doing this multiple times and places.

And do not prefer much for regex while dealing with html strings

edited May 23 '17 at 12:28

Community

1
1

answered Jan 06 '14 at 18:15

Suresh Atta

120,458
37
198
307

actually there is almost nothing available that really can guarantee to parse failure-free html-tokens. Try to [validate](http://validator.w3.org/) current HTML pages - plenty of pages (almost any news page I currently validate for my master thesis) fail the test as they don't contain valid HTML (some even include errors on purpose to prevent correct parsing or at least try to make it as hard as possible) – Roman Vottner Jan 06 '14 at 18:22
You can't back up the claim that regex can't parse html. The OP is only looking for the first tag, regex will do fine. He doesn't need `Jsoup` for just one tag. For instance, `/<.*?>/` would even work (on a limited dataset). – kddeisz Jan 06 '14 at 18:25
Ya, I may little harsh at regex, Agreed. No one is safe. Why I suggested Jsoup is, he might dealing with html parsing in a long run. Edited my post little. Thanks for the feedback. – Suresh Atta Jan 06 '14 at 18:27

score 1 · Answer 3 · answered Jan 06 '14 at 18:14

1

Take a look at java regular expressions here. If you need an introduction to regex look here. This is probably the quickest way to accomplish what you're looking for.

answered Jan 06 '14 at 18:14

kddeisz

5,162
3
21
44

HTML tag finder

3 Answers3