-2

I want to get <form> tag data from html code in java. I have extracted the HTML code in string. But not able to get the data from the tags. Can anyone tell me how to do it with regular expressions. I don't want to use parser because its a one time job.

The example is as below

<html>
<head>
   <title>new Start</title>
</head>

<body onLoad="document.forms[0].submit();">
<form action="http://www.google.com"   method="post">
    <input type=hidden name="NUMBER" value="123456">
    <input type=hidden name="mode" value="display">
    </form>
</body>
</html>

I need the action tag value and the input name and value.

Mr Lister
  • 45,515
  • 15
  • 108
  • 150
User3091
  • 221
  • 1
  • 5
  • 19

2 Answers2

1

You should not really use RegEx to parse HTML, you should get a HTML Parser. There are plenty around for Java. However, if you realy want to use RegEx, here's how.


To get the action="..." data, use the following RegEx:

action="(.*?)"

The data will be stored inside Capture Group #1

Live Demo on Regex101

How it works:

action=        # Select the action= attribute
"(.*?)"        # Capture the data inside the quotes

To get the input name and number, use the following RegEx:

input.*?name="(.*?)"\s*value="(.*?)"

The name will be stored in Capture Group #1, and the value in Capture Group #2

Live Demo on Regex101

How it works:

input        # Select the opening input tag name
.*?          # Optional Data
name=        # Select the name= attribute
"(.*?)"      # Capture the data inside the quotes
\s*          # Optional Whitespace
value=       # Select the value= attribute
"(.*?)"      # Capture the data inside the quotes
Kaspar Lee
  • 5,446
  • 4
  • 31
  • 54
0

You can use Jsoup (http://jsoup.org/). I do this in Scala but it's the same in Java (it's originally meant for Java). For e.g.

String connection = Jsoup.connect(url) 
.followRedirects(false) // otherwise you'll get into a loop
.timeout(3000) // also loop
.userAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36") // just copied from Google
.referrer("http://www.google.com")
.get()

This is just to get the html page, then you can parse it easily with the next variables. I also added next to the url -> (if (url.startsWith("http://") || url.startsWith("https://) url else "http://" + url) but you don't have to if you know all urls are valid

Then make another variable:

String url = connection
.getElementsByAttributeValueContaining("href", "facebook.com") 
.iterator()
.toList
.map(x => x.attr("href"))

for example, you can use any other url you're looking for in the html page (the second param is a regex, it will find anything with that field that contains the regex) when you do the iterator it takes all fields that matched your regex searched and will bring whichever field you will ask for, here I asked for the href but you can ask for any other field

or you can also use

String url = connection
.getElementsByAttributeValueMatching("type", "rss|atom")
.iterator()
.toList
.map(x => x.attr("href"))

this one is if you're looking for a specific match (the second param is also a regex here, it will find anything that matches exactly the regex you wrote), when you do the iterator it takes all fields that matched your regex searched and will bring whichever field you will ask for, here I asked for the href but you can ask for any other field

Hope this helps...

Neta Oren
  • 1
  • 1