3

I am looking for a way to process a HTML code from command line (probably using XPATH).

For example I want to remove in .container class or add new <div> after .container class.

Input:

<div class="bg-detail2" id="geometry">
    <div class="container">
        <h2>Title</h2>
        <div class="line"></div>
        <div class="fix"></div>
        <div class="col50">
            Content
        </div>
        <div class="col50">
            Another Content
        </div>
    </div>
</div>

Output:

<div class="bg-detail2" id="geometry">
    <div class="container">
      <div class="newdiv">
        <div class="line"></div>
        <div class="fix"></div>
        <div class="col50">
            Content
        </div>
        <div class="col50">
            Another Content
        </div>
      </div>
    </div>
</div>

My first idea is to use sed, but it is not a bullet proof method. I know xmllint, but it can only read HTML files.

Is there any other tool available for command line?

Tad Lispy
  • 2,806
  • 3
  • 30
  • 31
Adrian
  • 2,576
  • 9
  • 49
  • 97

4 Answers4

2

I couldn't find a program to do what you wanted. So I made one. And now it works!

#!python3

from html.parser import HTMLParser

class HTMLPass(HTMLParser):
    def __init__(self, *a, convert_charrefs=False, **k):
        super().__init__(*a, convert_charrefs=convert_charrefs, **k)

    def handle_starttag(self, tag, attrs):
        print(end=self.get_starttag_text())

    @staticmethod
    def handle_endtag(tag):
        print(end="</" + tag + ">")

    handle_startendtag = handle_starttag

    @staticmethod
    def handle_data(data):
        print(end=data)

    @staticmethod
    def handle_entityref(name):
        print(end="&"+name+";")

    @staticmethod
    def handle_charref(name):
        print(end="&#"+name+";")

    @staticmethod
    def handle_comment(data):
        print(end="<!--"+data+"-->")

    @staticmethod
    def handle_decl(decl):
        print(end="<!"+decl+">")

    @staticmethod
    def handle_pi(data):
        print(end="<?"+data+">")

    unknown_decl = handle_decl

class HTMLPassMod(HTMLPass):
    def __init__(self, *a, argv=None, **k):
        super().__init__(*a, **k)
        self.stack = []
        self.args = debugremoveme = []
        if argv is None:
            import sys
            argv = sys.argv[1:]
        for arg in argv:
            # Horrible string parsing
            # Should turn "/a#link-1.external/d" into
            # [d, ['a', ('id', 'link-1'), ('class', 'external')]]
            sel, act = arg[1:].split(arg[0])
            self.args.append([act])
            for selector in sel.split(">"):
                self.args[-1].append([])
                selector = selector.strip()
                if "." not in selector and "#" not in selector:
                    self.args[-1][-1].append(selector)
                    continue
                if "." not in selector:
                    self.args[-1][-1][:] = selector.split("#")
                    self.args[-1][-1][1:] = zip(["id"]*(len(self.args[-1][-1])-1), self.args[-1][-1][1:])
                    continue
                if "#" not in selector:
                    self.args[-1][-1][:] = selector.split(".")
                    self.args[-1][-1][1:] = zip(["class"]*(len(self.args[-1][-1])-1), self.args[-1][-1][1:])
                    continue
                if selector.index(".") < selector.index("#"):
                    tag, selector = selector.split(".", maxsplit=1)
                    selector = "." + selector
                else:
                    tag, selector = selector.split("#", maxsplit=1)
                    selector = "#" + selector
                self.args[-1][-1].append(tag)
                while selector:
                    if "#" not in selector:
                        self.args[-1][-1].extend(zip(["class"]*len(selector), selector.split(".")))
                        break
                    if "." not in selector:
                        self.args[-1][-1].extend(zip(["id"]*len(selector), selector.split("#")))
                        break
                    if selector[0] == ".":
                        if "." not in selector[1:] or selector.index("#") < selector.index("."):
                            axa, selector = selector[1:].split("#", maxsplit=1)
                        else:
                            axa, selector = selector[1:].split(".", maxsplit=1)
                        self.args[-1][-1].append(("class", axa))
                    else:
                        if "#" not in selector[1:] or selector.index(".") < selector.index("#"):
                            axa, selector = selector[1:].split(".", maxsplit=1)
                        else:
                            axa, selector = selector[1:].split("#", maxsplit=1)
                        self.args[-1][-1].append(("id", axa))

    def handle_starttag(self, tag, attrs):
        if self.stack and self.stack[-1][2] is not None and self.stack[-1][2][0] == 'k':
            # kill means kill
            self.stack.append((tag, attrs, None))
            return
        self.stack.append((tag, attrs, None))
        for arg in self.args:
            for frame, a in zip(self.stack[::-1], arg[:0:-1]):
                a_tag = a[0].replace("*", "").strip()
                if a_tag and frame[0] != a_tag:
                    break
                for attr, val in frame[1]:
                    if attr == "class":
                        frame_classes = val.split()
                        break
                else:
                    frame_classes = []
                for attr, val in a[1:]:
                    if attr == "class":
                        if val not in frame_classes:
                            break
                    else:
                        for a, v in frame[1]:
                            if a == attr and v == val:
                                break
                        else:
                            break
                else:
                    continue
                break
            else:
                self.stack[-1] = (tag, attrs, arg[0])
                if arg[0][0] in "drk":  # delete / replace / kill
                    if arg[0][0] == "r":
                        print(end=arg[0][1:])
                    return
                if arg[0][0] == "i":  # insert (inside / after)
                    super().handle_starttag(tag, attrs)
                    print(end=arg[0][2:].split(arg[0][1])[0])
                break
        else:
            super().handle_starttag(tag, attrs)

    def handle_startendtag(self, tag, attrs):
        self.handle_starttag(tag, attrs)
        self.stack.pop()

    def handle_endtag(self, tag):
        if self.stack[-1][0] != tag:
            # TODO: Implement proper HTML-isn't-XML behaviour
            pass
        frame = self.stack.pop()
        if frame[2] is None:
            return super().handle_endtag(tag)
        if frame[2][0] in "drk":  # delete / replace / kill
            return
        if frame[2][0] == "i":
            super().handle_endtag(tag)
            print(end=frame[2][2:].split(frame[2][1])[1])

    def handle_data(self, data):
        if self.stack and self.stack[-1][2] is not None and self.stack[-1][2][0] == 'k':
            return
        super().handle_data(data)

    def handle_entityref(self, name):
        if self.stack and self.stack[-1][2] is not None and self.stack[-1][2][0] == 'k':
            return
        super().handle_entityref(name)

    def handle_charref(self, name):
        if self.stack and self.stack[-1][2] is not None and self.stack[-1][2][0] == 'k':
            return
        super().handle_charref(name)

    def handle_comment(self, data):
        if self.stack and self.stack[-1][2] is not None and self.stack[-1][2][0] == 'k':
            return
        super().handle_comment(data)

    def handle_decl(self, decl):
        if self.stack and self.stack[-1][2] is not None and self.stack[-1][2][0] == 'k':
            return
        super().handle_data(decl)

    def handle_pi(self, data):
        if self.stack and self.stack[-1][2] is not None and self.stack[-1][2][0] == 'k':
            return
        super().handle_pi(data)

    def unknown_decl(self, data):
        if self.stack and self.stack[-1][2] is not None and self.stack[-1][2][0] == 'k':
            return
        super().unknown_decl(data)

def run(pass_through=HTMLPassMod):
    x = pass_through()
    while True:
        try:
            i = input()
        except EOFError:
            break
        x.feed(i + '\n')
    x.close()

if __name__ == "__main__":
    run()

This code is terrible, but will actually function properly including in many edge-cases.

Example usage:

wizzwizz4@wizzwizz4Laptop:~$ cat example_input.html
<div class="bg-detail2" id="geometry">
    <div class="container">
        <h2>Title</h2>
        <div class="line"></div>
        <div class="fix"></div>
        <div class="col50">
            Content
        </div>
        <div class="col50">
            Another Content
        </div>
    </div>
</div>
wizzwizz4@wizzwizz4Laptop:~$ <example_input.html ./rubbish_program.py ~div.newdiv~r<h2>Title</h2>
<div class="bg-detail2" id="geometry">
    <div class="container">
        <h2>Title</h2>
        <div class="line"></div>
        <div class="fix"></div>
        <div class="col50">
            Content
        </div>
        <div class="col50">
            Another Content
        </div>
    </div>
</div>
wizzwizz4@wizzwizz4Laptop:~$ cat example_input_2.html
<div class="bg-detail2" id="geometry">
    <div class="container">
        <h2>Title</h2>
        <div class="line"></div>
        <div class="fix"></div>
        <div class="col50">
            Content
        </div>
        <div class="col50">
            Another Content
        </div>
    </div>
</div>
wizzwizz4@wizzwizz4Laptop:~$ <example_input_2.html ./rubbish_program.py 'Jdiv.containerJi~<div class="newdiv">~</div>' '\.container > h2\k'
    <div class="bg-detail2" id="geometry">
    <div class="container"><div class="newdiv">

        <div class="line"></div>
        <div class="fix"></div>
        <div class="col50">
            Content
        </div>
        <div class="col50">
            Another Content
        </div>
    </div></div>
</div>

Syntax

./rubbish_program.py [argument...]

where argument is of the form:

<separator><selector><separator><instruction>

where:

  • separator is a single character that must not appear in selector or instruction.
  • selector is a series of tag.class1.class2#id.class3-like things, where there can only be one #id and tag is optional and there can be an unlimited number of .classns, separated by >. Example: div#geometry > .container > h2.
  • instruction is an instruction of the form:

    <command><parameters>
    

    where command is one of the following:

    • d – removes the element without removing its children. Takes no parameters.
    • r – replaces the start tag with parameters, and removes the end tag, without removing the element's children.
    • i – has two separate behaviours, depending on whether the tag is self-closing.

      • If it is not self-closing, prefixes the contents with the first parameter and suffixes the contents with the second parameter.
      • If it is self-closing, inserts the first parameter immediately after the tag, and ignores subsequent parameters.

      parameters is of the form:

      <separator2><first parameter><separator2><second parameter>[<separator2>discarded]
      

      separator2 must not occur in either parameter, and must be different from separator. It can have different values in separate invocations.

    • k – removes the element and its children. Takes no parameters.
wizzwizz4
  • 6,140
  • 2
  • 26
  • 62
1

If avoidable, do not parse HTML with regular expressions.

Instead, try an HTML parser with node, Python etc.

If you have docker installed, you can try this simple script:

docker run --rm -i phil294/jquery-jsdom '$("#geometry h2").remove(); $("#geometry").append("<div class=\"newdiv\"/>"); $("#geometry").prop("outerHTML")' <<< '
<div class="bg-detail2" id="geometry">
    <h2>Title</h2>
</div>

'

Demonstrates a simple remove / append. Power of JQuery to your hands. It uses jsdom with eval(). I hosted it here

phil294
  • 10,038
  • 8
  • 65
  • 98
0
 sed 's/<div class="container">/&\n      <div class="newdiv">/g' file_input.css

This will work with sed, but as you say it may not be bulletproof. This may also cause issues with your indentation, but if it is consistent throughout, you could use it...

Ribtips
  • 56
  • 2
0

First of all, you install this package:

sudo apt-get install html-xml-utils

There are 31 tools in this package, here is a summary of what they can do:

  • cexport – create headerfile of exported declarations from a C file

  • hxaddid – add ID’s to selected elements

  • hxcite - replace bibliographic references by hyperlinks

  • hxcite-mkbib - expand references and create bibliography

  • hxcopy - copy an HTML file while preserving relative links

  • hxcount – count elements and attributes in HTML or XML files

  • hxextract – extract selected elements

  • hxclean – apply heuristics to correct an HTML file

  • hxprune – remove marked elements from an HTML file

  • hxincl- expand included HTML or XML files

  • hxindex – create an alphabetically sorted index

  • hxmkbib – create bibliography from a template

  • hxmultitoc- create a table of contents for a set of HTML files

  • hxname2id- move some ID= or NAME= from A elements to their parents

  • hxnormalize – pretty-print an HTML file

  • hxnum – number section headings in an HTML file

  • hxpipe- convert XML to a format easier to parse with Perl or AWK

  • hxprintlinks- number links & add table of URLs at end of an HTML file

  • hxremove- remove selected elements from an XML file

  • hxtabletrans- transpose an HTML or XHTML table

  • hxtoc – insert a table of contents in an HTML file

  • hxuncdata – replace CDATA sections by character entities

  • hxunent – replace HTML predefined character entities to UTF-8

  • hxunpipe- convert output of pipe back to XML format

  • hxunxmlns – replace “global names” by XML Namespace prefixes

  • hxwls – list links in an HTML file

  • hxxmlns – replace XML Namespace prefixes by “global names”

  • asc2xml, xml2asc- convert between UTF8 and entities

  • hxref – generate cross-references

  • hxselect- extract elements that match a (CSS) selector

There are all the tools you need to manipulate an html file or xml file. As you wish.

Example hxprune:

hxprune -c container index.html > index2.html

You can choose your html selector, in this case, is a class "-c container", then you pass it the name of the file you want to manipulate and finally with this operator ">" you can redirect the output of hxprune to the other file. In the output you'll cut the .container branch of the html tree.

Κωλζαρ
  • 803
  • 1
  • 10
  • 22
  • Could you explain? `hxprune` only seems to target classes, and doesn't seem powerful enough to perform the tasks required. An example of use would be useful. – wizzwizz4 Feb 23 '19 at 14:42
  • Check if it's enough for you. Tell me I'm available. – Κωλζαρ Feb 23 '19 at 15:20
  • This still isn't enough to do what's described in the question. That'll just prune the `
    `, not insert content into it.
    – wizzwizz4 Feb 23 '19 at 15:21