0

I' m trying to write correct regex for searching value in html, but have some problems.

There is a piece of html:

<div class="inner">
<div class="title">Processing 3-D Secure Transaction</div>
<form autocomplete="off" name="PAResForm" id="PAResForm" action="https://www.alfaportal.ru/" method="POST">
<input name="MD" type="hidden" value="4326381105C3B67B2823E71FD235FFD2"><input value="eJzVWFmvo0iy/iulnkerm9UYt1xdQtJ2pkQdOVw5AW2qGv+is66Q
qrz9LBZ3mCe7mJzYARdloC1dJ/Lk+nQ7KBxxdgtIEgy/Tp/I93MZ5NtZzfdTnPdj5vfz7tex6I/n
4P8DRkGf4Q==" name="PaRes" type="hidden"> 

I'm trying to search string

<input name="MD" type="hidden" value="4326381105C3B67B2823E71FD235FFD2">

and get value

The problem is value and name can replace each other For example

<input value="4326381105C3B67B2823E71FD235FFD2" type="hidden" name="MD">

I wrote regex pattern:

<input.*name=\"MD\"|value=\"([^<>]*?)\"[^<>]*value=\"([^<>]*?)\"|name=\"MD\".*?>

it works in some online regex services, but does not work in real java.

Help please to modify it correctly.

Also I wrote simple command-line tool for testing it. http://pastebin.com/Pzynqrn8

RChugunov
  • 794
  • 2
  • 11
  • 26
  • are you sure that you realy have to use regxp for your taks? maybe you can use some tool? – Pavlo K. Oct 21 '13 at 08:25
  • 1
    This answer might help you decide not to use regexp... http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – SWilk Oct 21 '13 at 08:31

7 Answers7

2

I think to try something like this:

<input\s*?(value=['"].*?['"]\s*)|(type=['"].*?["']\s*)|(name=['"].*?['"]\s*)\>
Opalosolo
  • 252
  • 2
  • 7
2

There are a lot of tools for HTML parsing. I think you should not ignore them. It was discussed here.

Community
  • 1
  • 1
Pavlo K.
  • 371
  • 1
  • 9
  • 1
    It's perfectly fine to use regular expressions to parse a known subset of HTML. There really is no need for a full-blown parser here. – Marius Schulz Oct 21 '13 at 08:31
  • 1
    I thought so for some time. Learned that it is not a good idea the hard way. Do not parse html that is out of your control with regexps, as minor changes might easily invalidate them, when DOM parsing would still manage to do. – SWilk Oct 21 '13 at 08:46
2

I do not know how to do that in Java, but I would strongly recommend using proper Document Object Model tools etc.

In PHP I would do that:

$xml = new DomDocument();
$xml->loadXml($yourHTMLHere);
$xpath = new DOMXPath($xml);
$node = $xpath
    ->evaluate('//form[@name="PAResForm"]//input[@name="MD"]')
    ->item(0);
$yourValueIsHere = $node->getAttribute('value');

5 lines, totally readable, and does not care for attributes order. Java can do the same thing for sure, just search for proper classes.

And do not parse irregular language with regural expressions. Html is not regular language.

Community
  • 1
  • 1
SWilk
  • 3,261
  • 8
  • 30
  • 51
1

I'd use a lookahead in a pattern like that:

<input(?=[^>]+?name="MD")[^>]+?value="([A-Z0-9]+)"

You're basically saying that you're looking for an <input> element with a name of MD. That's the lookahead: (?=[^>]+?name="MD")), which doesn't consume any characters, but makes sure your name attribute is present. You're then simply matching the value of value in the first capturing group: ([A-Z0-9]+).

It might be helpful to write the pattern in free spacing mode:

<input               # opening input tag
(?=[^>]+?name="MD")  # lookahead looking for the presence of the name attribute
[^>]+?               # anything (whitespace, other attributes) up to ...
value="([A-Z0-9]+)"  # the value attribute and its value

[Update] Note that it's almost always better to use proper HTML parsers to parse HTML — that's what they're good for. In this case, using regular expressions is fine in my opinion. Just keep in mind the next guy who'll have to maintain your code and make a responsible decision.

Marius Schulz
  • 15,976
  • 12
  • 63
  • 97
1

As always, always, always when in comes to handling HTML: Use a parser. Regex is not up to the task, for technical reasons explained to death in a well-known post.

Java has jSoup and it is embarrassingly easy to create a small, simple and maintainable piece of code that does exactly what you need.

Document doc = Jsoup.parse(str);
Element input = doc.select("input[name='MD']").first();

if (input != null) {
    String value = input.attr("value");
    // now do something with it
}

Now compare this three-liner with all those hairy regex answers, think about how ummaintainable and unsafe they are, how much explanation they require and how you can completely rewrite them from scratch when the HTML changes. Count in the time you tried to find a solution for yourself and decide whether regex is worth it when it comes to HTML.

Community
  • 1
  • 1
Tomalak
  • 332,285
  • 67
  • 532
  • 628
0

As long as your element has these attributes only it's not hard:

    public static void main(String[] args) {
        Pattern p = Pattern.compile("<input(?:\\s+|name=\"MD\"|type=\"hidden\"|value=\"([^\"]+)\")+");
        Matcher m = p.matcher("<input name=\"MD\" type=\"hidden\" value=\"4326381105C3B67B2823E71FD235FFD2\">");
        if (m.find()) {
            System.out.println(m.group(1));
        }
    }
RokL
  • 2,663
  • 3
  • 22
  • 26
-1

Finally I solved this by adding another pattern. At first I am looking for a string like <input ... name='MD' ... /> by pattern ".*?(<input[^<>]*name=\\\"MD\\\"[^<>]*>).*?" and after that I am looking for a value in result string with pattern ".*?value=\\\"(.*?)\\\""

Thank u everyone for help

RChugunov
  • 794
  • 2
  • 11
  • 26
  • 2
    And now imagine that the site has added some html5 tricks and the element has now new attribute... ``... Now your regexp can extract the "true" instead of the value you want. Also take note, that the site might change quotes from `"` double quoutes to `'` single quotes. Your regexp does not expect that. – SWilk Oct 21 '13 at 10:25
  • 1
    If you really really want regexp, insert `\s` before attribute names, and catch used quotes and then backreference to them on close. Create a daily job which checks the site for compliance with your regexp and notifies you when the site have been changed. Because it will at some point. – SWilk Oct 21 '13 at 10:28