Use SED to extract value of all input elements with a certain name

Question

How do I get the value attribute based on a search of some other attribute?

For example:

<body>
<input name="dummy" value="foo">
<input name="alpha" value="bar">
</body>

How do I get the value of the input element with the name "dummy"?

You can get it with this command. sed -n 's/.*input name="dummy" value="\([^"]*\)".*/\1/p' But for this job, a html/xml parser is the right tool — ramana_k, Oct 13 '15 at 17:01

score 4 · Answer 1 · edited May 23 '17 at 12:32

Since you're looking for a solution using bash and sed, I'm assuming you're looking for a Linux command line option.

Use `hxselect` html parsing tool to extract element; use `sed` to extract value from element

I did a Google search for "linux bash parse html tool" and came across this: https://unix.stackexchange.com/questions/6389/how-to-parse-hundred-html-source-code-files-in-shell

The accepted answer suggests using the hxselect tool from the html-xml-utils package which extracts elements based on a css selector. So after installing (downoad, unzip, ./configure, make, make install), you can run this command using the given css selector

hxselect "input[name='dummy']" < example.html

(Given that example.html contains your example html from the question.) This will return:

<input name="dummy" value="foo"/>

Almost there. We need to extract the value from that line:

hxselect "input[name='dummy']" < example.html | sed -n -e "s/^.*value=['\"]\(.*\)['\"].*/\1/p"

Which returns "foo".

why you would / would not want to use this approach

using regex to parse out the attributes is complicated, and often the wrong way to go
the hxselect tool (in my other answer) is a pain to install
BUT, this approach accepts malformed html, which is what is argued for in this answer to the question linked above. By the way, that question has very thorough discussion on the regex+html debate.

I came back after the fact and I really don't think this is a good answer - it's awkward and too complicated, and doesn't follow @Ramana's advice because it's still parsing the element attributes with SED. I did some more research and answered again, with a different approach — alexanderbird, Oct 21 '15 at 16:55

score 2 · Answer 2 · edited May 23 '17 at 12:09

Since you're asking for SED, I'll assume you want a command line option. However, a tool built for html parsing may be more effective. The problem with my first answer is that I don't know of a way in css to select the value of an attribute (does anyone else?). However, with xml you can select attributes like you could other elements. Here is a command line option for using an xml parsing tool.

Treat it as XML; use XPATH

Install xmlstarlet with your package manager
Run xmlstarlet sel -t -v //input[@name=\'dummy\']/@value example.html (where example.html contains your html
If your html isn't valid xml, follow the warnings from xmlstarlet to make the necessary changes (in this case, <input> must be changed to <input/>
Run the command again. Returns: foo

why you might/might not use this approach

it is way more simple and robust than hand-rolling a regex html parser, but
it requires well formed html

This comes with well too many warnings for my html. – Petar Vasilev Aug 22 '22 at 13:18 — Petar Vasilev, Aug 22 '22 at 13:18

score 1 · Answer 3 · answered Oct 13 '15 at 17:14

Parsing HTML with sed is generally a bad idea, since sed works in a line-based manner and HTML does not usually consider newlines syntactically important. It's not good if your HTML-handling tools break when the HTML is reformatted.

Instead, consider using Python, which has an HTML push parser in its standard library. For example:

#!/usr/bin/python

from HTMLParser import HTMLParser
from sys import argv

# Our parser. It inherits the standard HTMLParser that does most of
# the work.
class MyParser(HTMLParser):
    # We just hook into the handling of start tags to extract the
    # attribute
    def handle_starttag(self, tag, attrs):
        # Build a dictionary from the attribute list for easier
        # handling
        attrs_dict = dict(attrs)

        # Then, if the tag matches our criteria
        if tag == 'input' \
           and 'name' in attrs_dict \
           and attrs_dict['name'] == 'dummy':
            # Print the value attribute (or an empty string if it
            # doesn't exist)
            print attrs_dict['value'] if 'value' in attrs_dict else ""

# After we defined the parser, all that's left is to use it. So,
# build one:
p = MyParser()

# And feed a file to it (here: the first command line argument)
with open(argv[1], 'rb') as f:
    p.feed(f.read())

Save this code as, say, foo.py, then run

python foo.py foo.html

where foo.html is your HTML file.

Use SED to extract value of all input elements with a certain name

3 Answers3

Use `hxselect` html parsing tool to extract element; use `sed` to extract value from element

why you would / would not want to use this approach

Treat it as XML; use XPATH

why you might/might not use this approach

Linked

Use SED to extract value of all input elements with a certain name

3 Answers3

Use hxselect html parsing tool to extract element; use sed to extract value from element

why you would / would not want to use this approach

Treat it as XML; use XPATH

why you might/might not use this approach

Linked

Use `hxselect` html parsing tool to extract element; use `sed` to extract value from element