How to grab raw all raw html within a certain XPath from a local file in Python

Question

I am trying to grab the raw html from a bunch of local html files. I had some help from this post in getting the raw file to read in:

Get all text inside a tag lxml

But the code I have currently produces the entire file instead of a subset. Right now I seem to be missing a line where I can choose an xpath I want to grab.

Here is the code I currently have:

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    parts = ([node.text] +
            list(chain(*([c.text, tostring(c), c.tail] for c
            in node.getchildren()))) +
            [node.tail])
    # filter removes possible Nones in texts and tails
    return ''.join(filter(None, parts))

for filename in os.listdir('../news/article/'):
    if (filename.endswith('.html') and not filename.startswith('._')):
        print filename;
        with open('../news/article/' + filename, "r") as f:
            page=f.read();
        tree=html.fromstring(page);
        maincontent = stringify_children(tree);
        print maincontent;

My end goal is to be able to get that in a string and output to a local file as only that div.

Here is a sample file:

<html>

<head>
    <title>Title</title>
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css">
</head>

<body>
    <div class="container">
        <div class="row">
            <div class="col-xs-4">
                <div class="left-bar"></div>
            </div>
            <div class="col-xs-4">
                <div class="middle-bar"></div>
            </div>
            <div class="col-xs-4">
                <div class="right-bar"></div>
            </div>
        </div>
        <div class="row">
            <div class="col-xs-3">
                <div class="navigation"></div>
            </div>
            <div class="col-xs-9">
                <div class="main-content">
                    Hello
                    <br>
                    <br><a href="http://www.stackexchange.com">Click here to visit stack exchange</a>
                    <h1>This is an introduction</h1>
                    <h3>This is the third header</h3>
                    <p>Lorem ipsum dolor sit amet.....</p>
                    <p>Lorem ipsum dolor sit amet.....</p>
                    <p>Lorem ipsum dolor sit amet.....</p>
                    <ul>
                        <li>list text</li>
                        <li>list text</li>
                        <li>list text</li>
                        <li>list text</li>
                    </ul>
                    <div class="row">
                        <div class="col-xs-4"><img src="#">More content 1</div>
                        <div class="col-xs-4"><img src="#">More content 2</div>
                        <div class="col-xs-4"><img src="#">More content 3</div>
                    </div>

                </div>
            </div>
        </div>
    </div>

</body>

</html>

I want to grab all of the content underneath the maincontent class. Here is the xpath of that class in this file:

XPath: /html/body/div/div[2]/div[2]/div

The program should output the following:

                    Hello
                    <br>
                    <br><a href="http://www.stackexchange.com">Click here to visit stack exchange</a>
                    <h1>This is an introduction</h1>
                    <h3>This is the third header</h3>
                    <p>Lorem ipsum dolor sit amet.....</p>
                    <p>Lorem ipsum dolor sit amet.....</p>
                    <p>Lorem ipsum dolor sit amet.....</p>
                    <ul>
                        <li>list text</li>
                        <li>list text</li>
                        <li>list text</li>
                        <li>list text</li>
                    </ul>
                    <div class="row">
                        <div class="col-xs-4"><img src="#">More content 1</div>
                        <div class="col-xs-4"><img src="#">More content 2</div>
                        <div class="col-xs-4"><img src="#">More content 3</div>
                    </div>

So you don't want the div itself? That will give you broken html are you sure you want that? — Padraic Cunningham, Jul 06 '16 at 21:57
Yes. I am sure because I will be importing the data into a new html document that already has that tag created. — Paul Loach, Jul 07 '16 at 13:25

Padraic Cunningham · Answer 1 · 2016-07-06T22:24:36.330

3

Using lxml:

from lxml import html

xm = html.fromstring(h)
div = xm.xpath("//div[@class='main-content']")[0]
print(div.text  + "".join(map(html.tostring, div.xpath("./*"))))

Or:

from lxml import html

xm = html.fromstring(h)
eles  = xm.xpath("//div[@class='main-content']/text() | //div[@class='main-content']/*")
print("".join([ele if  isinstance(ele, str) else html.tostring(ele) for ele in eles]))

edited Jul 06 '16 at 22:24

answered Jul 06 '16 at 22:15

Padraic Cunningham

176,452
29
245
321

score -1 · Accepted Answer · answered Jul 06 '16 at 21:38

You could try using BeautifulSoup. I'm not real versed in it, but you can do something like this (or cleaner, if you read up on BeautifulSoup :)

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("input.html"), 'html')
x = soup.find_all(class_="main-content")
for line in x[0].contents:
    print line,

You'll get output like this:

        Hello
         <br/>
<br/> <a href="http://www.stackexchange.com">Click here to visit stack exchange</a>
<h1>This is an introduction</h1>
<h3>This is the third header</h3>
<p>Lorem ipsum dolor sit amet.....</p>
<p>Lorem ipsum dolor sit amet.....</p>
<p>Lorem ipsum dolor sit amet.....</p>
<ul>
<li>list text</li>
<li>list text</li>
<li>list text</li>
<li>list text</li>
</ul>
<div class="row">
<div class="col-xs-4"><img src="#"/>More content 1</div>
<div class="col-xs-4"><img src="#"/>More content 2</div>
<div class="col-xs-4"><img src="#"/>More content 3</div>
</div>

BeautifulSoup will "fix" the HTML syntax, like the change from
to
and it'll keep the spacing inside of the elements. See the docs on it at: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

How to grab raw all raw html within a certain XPath from a local file in Python

2 Answers2