POSIX shell compliant method to find matching line with HTML element by name, then extract HTML element value

Question

Input Data Source

<!DOCTYPE html>
<html lang='en'>

<head>
  <meta charset='utf-8'>
</head>

<body>
  <main>
    <div class='wrapper'>
        <div class='float1'>
          <form id="form1" action="/endpoint" method="post">
            <input name="input1" type="hidden" value="value1" />
            <fieldset>
              <input id="input2" name="input2" value="value2">
              <input id="input3" name="input3" value="value3">
            </fieldset>
          </form>
        </div>
      </div>
  </main>

  <footer>
  </footer>

</body>

</html>

Output required

value1
value2
value3

Logic required

Find input element with name equal to "input1"
For this element, extract value contents

Preferences

SED or AWK would be preferred answer. Unsure of any other POSIX compliant method which could parse HTML.
Command would should be reusable, so multiple Shell variables can use the same command (with a different element name)

Gilles Quénot · Accepted Answer · 2019-12-17T15:42:14.310

2

Please, never ever use sed nor awk to parse HTML nor XML, but a proper html parser.

xmllint --xpath \
'string(/form[@id="form1"]/input[@name="input1"]/@value)' file

Output

value1

Edit after OP comments, using broken HTML

xidel -s --xpath \
'//form[@id="form1"]//input[starts-with(@name, "input")]/@value' file

Output

value1
value2
value3

Don't parse XML/HTML with regex, use a proper XML/HTML parser and a powerful xpath query.

theory :

According to the compiling theory, XML/HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML/HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.

realLife©®™ everyday tool in a shell :

You can use one of the following :

xmllint often installed by default with libxml2, xpath1 (check my wrapper to have newlines delimited output

xmlstarlet can edit, select, transform... Not installed by default, xpath1

xpath installed via perl's module XML::XPath, xpath1

xidel xpath3

saxon-lint my own project, wrapper over @Michael Kay's Saxon-HE Java library, xpath3

or you can use high level languages and proper libs, I think of :

python's lxml (from lxml import etree)

perl's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath

ruby nokogiri, check this example

php DOMXpath, check this example

Check: Using regular expressions with HTML tags

edited Dec 17 '19 at 15:42

answered Dec 17 '19 at 00:02

Gilles Quénot

173,512
41
224
223

It lints the file, it doesn't show an output result. There are lots of "parser error" about "Opening and ending tag mismatch" and a few "xmlParseEntityRef: no name" errors also. Even trying to explicitly state the path in the command still results in linting errors `xmllint --xpath 'string(/html/body/main/div/div/div/div/div/form[@id="form1"]/input[@name="input1"]/@value)' file`. Although explicitly stating the path would be an unsatisfactory command, as the file could dynamically change – seafre Dec 17 '19 at 13:56
Adding html operator `xmllint --html --xpath` resolves all but 2 linting errors which are `Tag main invalid` and `Tag footer invalid`. Still cannot make the this resolve, I will update the Data Source in original question to be more accurate with scenario – seafre Dec 17 '19 at 15:11
POST edited accordingly, added `xidel` way for broken HTML – Gilles Quénot Dec 17 '19 at 15:42
1

xidel worked, although this adds an additional package required to the OS beyond initial installation in POSIX environment. Agree this is a more manageable solution than sed or awk. – seafre Dec 17 '19 at 23:06

POSIX shell compliant method to find matching line with HTML element by name, then extract HTML element value

1 Answers1

Output

Edit after OP comments, using broken HTML

Output

theory :

realLife©®™ everyday tool in a shell :

or you can use high level languages and proper libs, I think of :