How to get html tag data in java with reqular expression

Question

I want to get <form> tag data from html code in java. I have extracted the HTML code in string. But not able to get the data from the tags. Can anyone tell me how to do it with regular expressions. I don't want to use parser because its a one time job.

The example is as below

<html>
<head>
   <title>new Start</title>
</head>

<body onLoad="document.forms[0].submit();">
<form action="http://www.google.com"   method="post">
    <input type=hidden name="NUMBER" value="123456">
    <input type=hidden name="mode" value="display">
    </form>
</body>
</html>

I need the action tag value and the input name and value.

[Don't parse HTML with regex!](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Biffen, Mar 30 '16 at 09:54
Use an HTML parser. There are plenty of those for most languages, including Java. — Biffen, Mar 30 '16 at 09:59
Is there any inbuilt parser in java for HTML ? I don't want to use outside libraries. — User3091, Mar 30 '16 at 10:01
'I don't want to use parser because its a one time job' it would have been quicker to use a parser than to write this question — beresfordt, Mar 30 '16 at 10:18

score 1 · Answer 1 · answered Mar 30 '16 at 10:00

1

You should not really use RegEx to parse HTML, you should get a HTML Parser. There are plenty around for Java. However, if you realy want to use RegEx, here's how.

To get the action="..." data, use the following RegEx:

action="(.*?)"

The data will be stored inside Capture Group #1

Live Demo on Regex101

How it works:

action=        # Select the action= attribute
"(.*?)"        # Capture the data inside the quotes

To get the input name and number, use the following RegEx:

input.*?name="(.*?)"\s*value="(.*?)"

The name will be stored in Capture Group #1, and the value in Capture Group #2

Live Demo on Regex101

How it works:

input        # Select the opening input tag name
.*?          # Optional Data
name=        # Select the name= attribute
"(.*?)"      # Capture the data inside the quotes
\s*          # Optional Whitespace
value=       # Select the value= attribute
"(.*?)"      # Capture the data inside the quotes

answered Mar 30 '16 at 10:00

Kaspar Lee

5,446
4
31
54

Hey hii.. As I am using java I used – User3091 Mar 30 '16 at 10:05
Pattern pattern = Pattern.compile("action=\"(.*?)\""); Matcher matcher = pattern.matcher(htmlString); – User3091 Mar 30 '16 at 10:05
But still its not matching – User3091 Mar 30 '16 at 10:05
Hey its working. Sorry I was using debugger so it was showing me match first time and later to conform it tried to get same result ie it tried to match again so it became false. – User3091 Mar 30 '16 at 10:11
@RasikaKulkarni If this answered your question, could you please mark it as accepted (press the tick underneath the voting buttons)? Thanks! – Kaspar Lee Mar 30 '16 at 10:13

score 0 · Answer 2 · answered Mar 30 '16 at 10:22

You can use Jsoup (http://jsoup.org/). I do this in Scala but it's the same in Java (it's originally meant for Java). For e.g.

String connection = Jsoup.connect(url) 
.followRedirects(false) // otherwise you'll get into a loop
.timeout(3000) // also loop
.userAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36") // just copied from Google
.referrer("http://www.google.com")
.get()

This is just to get the html page, then you can parse it easily with the next variables. I also added next to the url -> (if (url.startsWith("http://") || url.startsWith("https://) url else "http://" + url) but you don't have to if you know all urls are valid

Then make another variable:

String url = connection
.getElementsByAttributeValueContaining("href", "facebook.com") 
.iterator()
.toList
.map(x => x.attr("href"))

for example, you can use any other url you're looking for in the html page (the second param is a regex, it will find anything with that field that contains the regex) when you do the iterator it takes all fields that matched your regex searched and will bring whichever field you will ask for, here I asked for the href but you can ask for any other field

or you can also use

String url = connection
.getElementsByAttributeValueMatching("type", "rss|atom")
.iterator()
.toList
.map(x => x.attr("href"))

this one is if you're looking for a specific match (the second param is also a regex here, it will find anything that matches exactly the regex you wrote), when you do the iterator it takes all fields that matched your regex searched and will bring whichever field you will ask for, here I asked for the href but you can ask for any other field

Hope this helps...

How to get html tag data in java with reqular expression

2 Answers2