Manipulate, process HTML from command line

Question

I am looking for a way to process a HTML code from command line (probably using XPATH).

For example I want to remove in .container class or add new <div> after .container class.

Input:

<div class="bg-detail2" id="geometry">
    <div class="container">
        <h2>Title</h2>
        <div class="line"></div>
        <div class="fix"></div>
        <div class="col50">
            Content
        </div>
        <div class="col50">
            Another Content
        </div>
    </div>
</div>

Output:

<div class="bg-detail2" id="geometry">
    <div class="container">
      <div class="newdiv">
        <div class="line"></div>
        <div class="fix"></div>
        <div class="col50">
            Content
        </div>
        <div class="col50">
            Another Content
        </div>
      </div>
    </div>
</div>

My first idea is to use sed, but it is not a bullet proof method. I know xmllint, but it can only read HTML files.

Is there any other tool available for command line?

https://www.technomancy.org/xml/add-a-subnode-command-line-xmlstarlet/ — LMC, Feb 20 '19 at 16:02

wizzwizz4 · Answer 1 · 2019-02-23T15:46:31.850

I couldn't find a program to do what you wanted. So I made one. And now it works!

#!python3

from html.parser import HTMLParser

class HTMLPass(HTMLParser):
    def __init__(self, *a, convert_charrefs=False, **k):
        super().__init__(*a, convert_charrefs=convert_charrefs, **k)

    def handle_starttag(self, tag, attrs):
        print(end=self.get_starttag_text())

    @staticmethod
    def handle_endtag(tag):
        print(end="</" + tag + ">")

    handle_startendtag = handle_starttag

    @staticmethod
    def handle_data(data):
        print(end=data)

    @staticmethod
    def handle_entityref(name):
        print(end="&"+name+";")

    @staticmethod
    def handle_charref(name):
        print(end="&#"+name+";")

    @staticmethod
    def handle_comment(data):
        print(end="<!--"+data+"-->")

    @staticmethod
    def handle_decl(decl):
        print(end="<!"+decl+">")

    @staticmethod
    def handle_pi(data):
        print(end="<?"+data+">")

    unknown_decl = handle_decl

class HTMLPassMod(HTMLPass):
    def __init__(self, *a, argv=None, **k):
        super().__init__(*a, **k)
        self.stack = []
        self.args = debugremoveme = []
        if argv is None:
            import sys
            argv = sys.argv[1:]
        for arg in argv:
            # Horrible string parsing
            # Should turn "/a#link-1.external/d" into
            # [d, ['a', ('id', 'link-1'), ('class', 'external')]]
            sel, act = arg[1:].split(arg[0])
            self.args.append([act])
            for selector in sel.split(">"):
                self.args[-1].append([])
                selector = selector.strip()
                if "." not in selector and "#" not in selector:
                    self.args[-1][-1].append(selector)
                    continue
                if "." not in selector:
                    self.args[-1][-1][:] = selector.split("#")
                    self.args[-1][-1][1:] = zip(["id"]*(len(self.args[-1][-1])-1), self.args[-1][-1][1:])
                    continue
                if "#" not in selector:
                    self.args[-1][-1][:] = selector.split(".")
                    self.args[-1][-1][1:] = zip(["class"]*(len(self.args[-1][-1])-1), self.args[-1][-1][1:])
                    continue
                if selector.index(".") < selector.index("#"):
                    tag, selector = selector.split(".", maxsplit=1)
                    selector = "." + selector
                else:
                    tag, selector = selector.split("#", maxsplit=1)
                    selector = "#" + selector
                self.args[-1][-1].append(tag)
                while selector:
                    if "#" not in selector:
                        self.args[-1][-1].extend(zip(["class"]*len(selector), selector.split(".")))
                        break
                    if "." not in selector:
                        self.args[-1][-1].extend(zip(["id"]*len(selector), selector.split("#")))
                        break
                    if selector[0] == ".":
                        if "." not in selector[1:] or selector.index("#") < selector.index("."):
                            axa, selector = selector[1:].split("#", maxsplit=1)
                        else:
                            axa, selector = selector[1:].split(".", maxsplit=1)
                        self.args[-1][-1].append(("class", axa))
                    else:
                        if "#" not in selector[1:] or selector.index(".") < selector.index("#"):
                            axa, selector = selector[1:].split(".", maxsplit=1)
                        else:
                            axa, selector = selector[1:].split("#", maxsplit=1)
                        self.args[-1][-1].append(("id", axa))

    def handle_starttag(self, tag, attrs):
        if self.stack and self.stack[-1][2] is not None and self.stack[-1][2][0] == 'k':
            # kill means kill
            self.stack.append((tag, attrs, None))
            return
        self.stack.append((tag, attrs, None))
        for arg in self.args:
            for frame, a in zip(self.stack[::-1], arg[:0:-1]):
                a_tag = a[0].replace("*", "").strip()
                if a_tag and frame[0] != a_tag:
                    break
                for attr, val in frame[1]:
                    if attr == "class":
                        frame_classes = val.split()
                        break
                else:
                    frame_classes = []
                for attr, val in a[1:]:
                    if attr == "class":
                        if val not in frame_classes:
                            break
                    else:
                        for a, v in frame[1]:
                            if a == attr and v == val:
                                break
                        else:
                            break
                else:
                    continue
                break
            else:
                self.stack[-1] = (tag, attrs, arg[0])
                if arg[0][0] in "drk":  # delete / replace / kill
                    if arg[0][0] == "r":
                        print(end=arg[0][1:])
                    return
                if arg[0][0] == "i":  # insert (inside / after)
                    super().handle_starttag(tag, attrs)
                    print(end=arg[0][2:].split(arg[0][1])[0])
                break
        else:
            super().handle_starttag(tag, attrs)

    def handle_startendtag(self, tag, attrs):
        self.handle_starttag(tag, attrs)
        self.stack.pop()

    def handle_endtag(self, tag):
        if self.stack[-1][0] != tag:
            # TODO: Implement proper HTML-isn't-XML behaviour
            pass
        frame = self.stack.pop()
        if frame[2] is None:
            return super().handle_endtag(tag)
        if frame[2][0] in "drk":  # delete / replace / kill
            return
        if frame[2][0] == "i":
            super().handle_endtag(tag)
            print(end=frame[2][2:].split(frame[2][1])[1])

    def handle_data(self, data):
        if self.stack and self.stack[-1][2] is not None and self.stack[-1][2][0] == 'k':
            return
        super().handle_data(data)

    def handle_entityref(self, name):
        if self.stack and self.stack[-1][2] is not None and self.stack[-1][2][0] == 'k':
            return
        super().handle_entityref(name)

    def handle_charref(self, name):
        if self.stack and self.stack[-1][2] is not None and self.stack[-1][2][0] == 'k':
            return
        super().handle_charref(name)

    def handle_comment(self, data):
        if self.stack and self.stack[-1][2] is not None and self.stack[-1][2][0] == 'k':
            return
        super().handle_comment(data)

    def handle_decl(self, decl):
        if self.stack and self.stack[-1][2] is not None and self.stack[-1][2][0] == 'k':
            return
        super().handle_data(decl)

    def handle_pi(self, data):
        if self.stack and self.stack[-1][2] is not None and self.stack[-1][2][0] == 'k':
            return
        super().handle_pi(data)

    def unknown_decl(self, data):
        if self.stack and self.stack[-1][2] is not None and self.stack[-1][2][0] == 'k':
            return
        super().unknown_decl(data)

def run(pass_through=HTMLPassMod):
    x = pass_through()
    while True:
        try:
            i = input()
        except EOFError:
            break
        x.feed(i + '\n')
    x.close()

if __name__ == "__main__":
    run()

This code is terrible, but will actually function properly including in many edge-cases.

Example usage:

wizzwizz4@wizzwizz4Laptop:~$ cat example_input.html
<div class="bg-detail2" id="geometry">
    <div class="container">
        <h2>Title</h2>
        <div class="line"></div>
        <div class="fix"></div>
        <div class="col50">
            Content
        </div>
        <div class="col50">
            Another Content
        </div>
    </div>
</div>
wizzwizz4@wizzwizz4Laptop:~$ <example_input.html ./rubbish_program.py ~div.newdiv~r<h2>Title</h2>
<div class="bg-detail2" id="geometry">
    <div class="container">
        <h2>Title</h2>
        <div class="line"></div>
        <div class="fix"></div>
        <div class="col50">
            Content
        </div>
        <div class="col50">
            Another Content
        </div>
    </div>
</div>
wizzwizz4@wizzwizz4Laptop:~$ cat example_input_2.html
<div class="bg-detail2" id="geometry">
    <div class="container">
        <h2>Title</h2>
        <div class="line"></div>
        <div class="fix"></div>
        <div class="col50">
            Content
        </div>
        <div class="col50">
            Another Content
        </div>
    </div>
</div>
wizzwizz4@wizzwizz4Laptop:~$ <example_input_2.html ./rubbish_program.py 'Jdiv.containerJi~<div class="newdiv">~</div>' '\.container > h2\k'
    <div class="bg-detail2" id="geometry">
    <div class="container"><div class="newdiv">

        <div class="line"></div>
        <div class="fix"></div>
        <div class="col50">
            Content
        </div>
        <div class="col50">
            Another Content
        </div>
    </div></div>
</div>

Syntax

./rubbish_program.py [argument...]

where argument is of the form:

<separator><selector><separator><instruction>

where:

separator is a single character that must not appear in selector or instruction.
selector is a series of tag.class1.class2#id.class3-like things, where there can only be one #id and tag is optional and there can be an unlimited number of .classns, separated by >. Example: div#geometry > .container > h2.
instruction is an instruction of the form:
```
<command><parameters>
```
where command is one of the following:
- d – removes the element without removing its children. Takes no parameters.
- r – replaces the start tag with parameters, and removes the end tag, without removing the element's children.
- i – has two separate behaviours, depending on whether the tag is self-closing.
  - If it is not self-closing, prefixes the contents with the first parameter and suffixes the contents with the second parameter.
  - If it is self-closing, inserts the first parameter immediately after the tag, and ignores subsequent parameters.
  parameters is of the form:
```
<separator2><first parameter><separator2><second parameter>[<separator2>discarded]
```
  separator2 must not occur in either parameter, and must be different from separator. It can have different values in separate invocations.
- k – removes the element and its children. Takes no parameters.

Ok, this is missing in your post! Please, add a shebang or some explanations. — F. Hauri - Give Up GitHub, Feb 23 '19 at 10:59

phil294 · Answer 2 · 2019-03-26T19:53:47.870

If avoidable, do not parse HTML with regular expressions.

Instead, try an HTML parser with node, Python etc.

If you have docker installed, you can try this simple script:

docker run --rm -i phil294/jquery-jsdom '$("#geometry h2").remove(); $("#geometry").append("<div class=\"newdiv\"/>"); $("#geometry").prop("outerHTML")' <<< '
<div class="bg-detail2" id="geometry">
    <h2>Title</h2>
</div>

'

Demonstrates a simple remove / append. Power of JQuery to your hands. It uses jsdom with eval(). I hosted it here

score 0 · Answer 3 · answered Feb 15 '19 at 00:42

 sed 's/<div class="container">/&\n      <div class="newdiv">/g' file_input.css

This will work with sed, but as you say it may not be bulletproof. This may also cause issues with your indentation, but if it is consistent throughout, you could use it...

Κωλζαρ · Answer 4 · 2019-02-23T15:17:26.290

First of all, you install this package:

sudo apt-get install html-xml-utils

There are 31 tools in this package, here is a summary of what they can do:

cexport – create headerfile of exported declarations from a C file
hxaddid – add ID’s to selected elements
hxcite - replace bibliographic references by hyperlinks
hxcite-mkbib - expand references and create bibliography
hxcopy - copy an HTML file while preserving relative links
hxcount – count elements and attributes in HTML or XML files
hxextract – extract selected elements
hxclean – apply heuristics to correct an HTML file
hxprune – remove marked elements from an HTML file
hxincl- expand included HTML or XML files
hxindex – create an alphabetically sorted index
hxmkbib – create bibliography from a template
hxmultitoc- create a table of contents for a set of HTML files
hxname2id- move some ID= or NAME= from A elements to their parents
hxnormalize – pretty-print an HTML file
hxnum – number section headings in an HTML file
hxpipe- convert XML to a format easier to parse with Perl or AWK
hxprintlinks- number links & add table of URLs at end of an HTML file
hxremove- remove selected elements from an XML file
hxtabletrans- transpose an HTML or XHTML table
hxtoc – insert a table of contents in an HTML file
hxuncdata – replace CDATA sections by character entities
hxunent – replace HTML predefined character entities to UTF-8
hxunpipe- convert output of pipe back to XML format
hxunxmlns – replace “global names” by XML Namespace prefixes
hxwls – list links in an HTML file
hxxmlns – replace XML Namespace prefixes by “global names”
asc2xml, xml2asc- convert between UTF8 and entities
hxref – generate cross-references
hxselect- extract elements that match a (CSS) selector

There are all the tools you need to manipulate an html file or xml file. As you wish.

Example hxprune:

hxprune -c container index.html > index2.html

You can choose your html selector, in this case, is a class "-c container", then you pass it the name of the file you want to manipulate and finally with this operator ">" you can redirect the output of hxprune to the other file. In the output you'll cut the .container branch of the html tree.

Could you explain? `hxprune` only seems to target classes, and doesn't seem powerful enough to perform the tasks required. An example of use would be useful. — wizzwizz4, Feb 23 '19 at 14:42
This still isn't enough to do what's described in the question. That'll just prune the `
`, not insert content into it. — wizzwizz4, Feb 23 '19 at 15:21

Manipulate, process HTML from command line

4 Answers4

Syntax