0

POSIX shell compliant method to find matching line with HTML element by name, then extract HTML element value

Input Data Source

<!DOCTYPE html>
<html lang='en'>

<head>
  <meta charset='utf-8'>
</head>

<body>
  <main>
    <div class='wrapper'>
        <div class='float1'>
          <form id="form1" action="/endpoint" method="post">
            <input name="input1" type="hidden" value="value1" />
            <fieldset>
              <input id="input2" name="input2" value="value2">
              <input id="input3" name="input3" value="value3">
            </fieldset>
          </form>
        </div>
      </div>
  </main>

  <footer>
  </footer>

</body>

</html>

Output required

value1
value2
value3

Logic required

  1. Find input element with name equal to "input1"
  2. For this element, extract value contents

Preferences

  • SED or AWK would be preferred answer. Unsure of any other POSIX compliant method which could parse HTML.
  • Command would should be reusable, so multiple Shell variables can use the same command (with a different element name)
seafre
  • 67
  • 2
  • 10

1 Answers1

2

Please, never ever use nor to parse HTML nor XML, but a proper parser.

xmllint --xpath \
'string(/form[@id="form1"]/input[@name="input1"]/@value)' file

Output

value1

Edit after OP comments, using broken HTML

xidel -s --xpath \
'//form[@id="form1"]//input[starts-with(@name, "input")]/@value' file

Output

value1
value2
value3

Don't parse XML/HTML with regex, use a proper XML/HTML parser and a powerful query.

theory :

According to the compiling theory, XML/HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML/HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.

realLife©®™ everyday tool in a :

You can use one of the following :

xmllint often installed by default with libxml2, xpath1 (check my wrapper to have newlines delimited output

xmlstarlet can edit, select, transform... Not installed by default, xpath1

xpath installed via perl's module XML::XPath, xpath1

xidel xpath3

saxon-lint my own project, wrapper over @Michael Kay's Saxon-HE Java library, xpath3

or you can use high level languages and proper libs, I think of :

's lxml (from lxml import etree)

's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath

, check this example

DOMXpath, check this example


Check: Using regular expressions with HTML tags

enter image description here

Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • It lints the file, it doesn't show an output result. There are lots of "parser error" about "Opening and ending tag mismatch" and a few "xmlParseEntityRef: no name" errors also. Even trying to explicitly state the path in the command still results in linting errors `xmllint --xpath 'string(/html/body/main/div/div/div/div/div/form[@id="form1"]/input[@name="input1"]/@value)' file`. Although explicitly stating the path would be an unsatisfactory command, as the file could dynamically change – seafre Dec 17 '19 at 13:56
  • Adding html operator `xmllint --html --xpath` resolves all but 2 linting errors which are `Tag main invalid` and `Tag footer invalid`. Still cannot make the this resolve, I will update the Data Source in original question to be more accurate with scenario – seafre Dec 17 '19 at 15:11
  • POST edited accordingly, added `xidel` way for broken HTML – Gilles Quénot Dec 17 '19 at 15:42
  • 1
    xidel worked, although this adds an additional package required to the OS beyond initial installation in POSIX environment. Agree this is a more manageable solution than sed or awk. – seafre Dec 17 '19 at 23:06