0

I'm having a hard time getting my regex to work, or I'm pretty sure that's where the problem is.

Here's an example of the source code, I'm trying to get all the normal text from the whole source code, word by word, and no number's or special symbols.

<a href="/public/">A university of fine tradition, dynamic study life and international possibilities.<span></span> </a>

Here's the part of the code.

String theRegex = "</>>(\\w+)</<> ";
    String str2Check = "<a href="/public/">A university of fine tradition, dynamic study life and international possibilities.<span></span> </a>";

            Pattern p = Pattern.compile(theRegex, Pattern.MULTILINE);
            Matcher m = p.matcher(str);
            if (m.find()) {
                System.out.println(m.group(1));
            }

I've tried different regex combinations, but somehow I cant get them right (probably because I keep mixing with them).

Hopefully you can understand what I'm asking here, thank you.

  • 1
    you might wanna have a closer look at this (famous) so question : http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – benzonico Feb 07 '16 at 11:53
  • I don't know what is the purpose of your example but you might be interested in Jsoup library. See an answer here: http://stackoverflow.com/questions/240546/remove-html-tags-from-a-string – Jan Bouchner Feb 07 '16 at 11:57
  • Regex is not the tool to use here as mentioned already. If it's not for arbitrary html (you know input) seperate tags from what's around: `<[^>]+>|(\b[^><]+)` and grab matches from first capture group. [See demo and explanation at regex101](https://regex101.com/r/xF3jJ3/2). – bobble bubble Feb 07 '16 at 12:38
  • It is not clear what you want. You have an input, but you have not described what the output should be. – matt Feb 07 '16 at 12:54
  • I'm gonna try to lighten up what i want as end result. As someone already answered here, the output I want looks somewhat like: A university of fine tradition, dynamic study life and international possibilities. All on separate lines. But if I run the code on a website, I'm also left with: <![endif]--> – Iron Ingåt Feb 07 '16 at 13:11
  • @bobblebubble That worked surprisingly well, although I have no idea what that regex means, its almost perfect, I'm still left with: $('.footer-loginbutton').css('border-radius', '0px'); $('#bib-tabs .submitbutton').css('border-radius', '0px'); That stuff, you think of anything to get rid of those too? and could you clarify that regex? – Iron Ingåt Feb 07 '16 at 13:25
  • That's why regex is not for parsing html. Probably you have some `script` need to provide source. [You can try like this demo](https://regex101.com/r/dE1qW9/1). Your desired parts will now be from **second** capture group. – bobble bubble Feb 07 '16 at 13:36
  • @bobblebubble I think you sent that in Python or something, it wont work. – Iron Ingåt Feb 07 '16 at 16:32

1 Answers1

0

I'm having a hard time getting my regex to work

If I understood you correctly, you are searching for a Regex which eliminates the HTML Tags like <> and gives you rest of the String Tokens.

Here is a quick code snippet:

public static void main (String[] args)
{
    String str2Check = "<a href=\"public\">A university of fine tradition, dynamic study life and international possibilities.<span></span></a>";
    String newString = str2Check.replaceAll("\\<[a-zA-Z0-9.,; /=\"]+\\>","");

    StringTokenizer st = new StringTokenizer(newString);  
    while (st.hasMoreTokens()) {  
        System.out.println(st.nextToken());  
    }  
}

Output:

A
university
of
fine
tradition,
dynamic
study
life
and
international
possibilities.
user2004685
  • 9,548
  • 5
  • 37
  • 54
  • Almost, I tested your code on a website and I get alot of this(example): – Iron Ingåt Feb 07 '16 at 13:03
  • That's because you also have a dot `.` in `<>`. I have not counted it in the Regex. Updated the Solution. You can add more characters which you think can come between `<>` in `a-zA-Z0-9. /=\"` list. Try it now. – user2004685 Feb 07 '16 at 13:16