How do I get the value attribute based on a search of some other attribute?
For example:
<body>
<input name="dummy" value="foo">
<input name="alpha" value="bar">
</body>
How do I get the value of the input element with the name "dummy"?
How do I get the value attribute based on a search of some other attribute?
For example:
<body>
<input name="dummy" value="foo">
<input name="alpha" value="bar">
</body>
How do I get the value of the input element with the name "dummy"?
Since you're looking for a solution using bash and sed, I'm assuming you're looking for a Linux command line option.
hxselect
html parsing tool to extract element; use sed
to extract value from elementI did a Google search for "linux bash parse html tool" and came across this: https://unix.stackexchange.com/questions/6389/how-to-parse-hundred-html-source-code-files-in-shell
The accepted answer suggests using the hxselect
tool from the html-xml-utils package which extracts elements based on a css selector.
So after installing (downoad, unzip, ./configure
, make
, make install
), you can run this command using the given css selector
hxselect "input[name='dummy']" < example.html
(Given that example.html contains your example html from the question.) This will return:
<input name="dummy" value="foo"/>
Almost there. We need to extract the value from that line:
hxselect "input[name='dummy']" < example.html | sed -n -e "s/^.*value=['\"]\(.*\)['\"].*/\1/p"
Which returns "foo".
Since you're asking for SED, I'll assume you want a command line option. However, a tool built for html parsing may be more effective. The problem with my first answer is that I don't know of a way in css to select the value of an attribute (does anyone else?). However, with xml you can select attributes like you could other elements. Here is a command line option for using an xml parsing tool.
xmlstarlet
with your package managerxmlstarlet sel -t -v //input[@name=\'dummy\']/@value example.html
(where example.html contains your html<input>
must be changed to <input/>
foo
Parsing HTML with sed is generally a bad idea, since sed works in a line-based manner and HTML does not usually consider newlines syntactically important. It's not good if your HTML-handling tools break when the HTML is reformatted.
Instead, consider using Python, which has an HTML push parser in its standard library. For example:
#!/usr/bin/python
from HTMLParser import HTMLParser
from sys import argv
# Our parser. It inherits the standard HTMLParser that does most of
# the work.
class MyParser(HTMLParser):
# We just hook into the handling of start tags to extract the
# attribute
def handle_starttag(self, tag, attrs):
# Build a dictionary from the attribute list for easier
# handling
attrs_dict = dict(attrs)
# Then, if the tag matches our criteria
if tag == 'input' \
and 'name' in attrs_dict \
and attrs_dict['name'] == 'dummy':
# Print the value attribute (or an empty string if it
# doesn't exist)
print attrs_dict['value'] if 'value' in attrs_dict else ""
# After we defined the parser, all that's left is to use it. So,
# build one:
p = MyParser()
# And feed a file to it (here: the first command line argument)
with open(argv[1], 'rb') as f:
p.feed(f.read())
Save this code as, say, foo.py
, then run
python foo.py foo.html
where foo.html
is your HTML file.