python HTML parsing issue

Question

Given an html page, I would like to only get an array of variables like this (id1, value1), (id2, value2), ...., the file is given like this:

    <div class="col m3 s12 col_title"><div class="font-small grey-text truncate content" title="value1">value1</div></div>
    <div class="col m7 s12 col_id"><div class="content wrap">id1</div></div>

every value is followed by a "content wrap" id. I was thinking of something like:

match = re.compile('title="(.+?)".+?wrap"(.+?)"').findall(source)

This is an example:

<li class="collection-item Ids ">
<div class="row">
    <div class="col m3 s12 col_title"><div class="font-small grey-text truncate content" title="filename1">filename1</div></div>
    <div class="col m7 s12 col_id"><div class="content wrap">6000bc3211af43d7</div></div>
    <div></div>
    <div class="col m2 s12 col_time">
        <div class="content">
            <a href="http://test.com/test.php" target="_blank" class="secondary-content pull-right">
                <span class="font-small grey-text" title="filex">test</span>
                <i class="fa fa-external-link" aria-hidden="true" title="filey"></i>
            </a>
        </div>
    </div>
</div>

example with filenames and values like an md5 or a hash :
filename1

6000bc3211af43d7

filename2

32475af45c6bc432 — K. H., Feb 10 '18 at 15:32

TheWildHealer · Answer 1 · 2018-02-10T14:54:02.123

0

You can try to use Beautiful Soup, it should have everything you need for parsing HTML.

For exemple, you could use :

# open the html from the website or from a file, check the doc
soup = BeautifulSoup(urllib.urlopen(yoururl), "lxml") 
result = soup.find_all(class_="content wrap").get_text()

Here, result would be an array containing all the text contents inside the elements that have a "content wrap" class.

edited Feb 10 '18 at 14:54

answered Feb 10 '18 at 14:47

TheWildHealer

1,546
1
15
26

score 0 · Answer 2 · answered Feb 10 '18 at 14:51

0

Can you show the example for id1 and value1? I have a idea :D

\w{1,}\d{1,}< And getting from 1 to len(match)-1 It can not true.

answered Feb 10 '18 at 14:51

AJackTi

63
2
7

an example would like like this:
filename1

6000bc3211af43d7

filename2

32475af45c6bc432
– K. H. Feb 10 '18 at 15:31

yenter · Answer 3 · 2018-02-11T08:53:06.500

0

Building on TheWildHealer's answer, you can use the following:

from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("http://websitehere.com").text, "lxml")
results = []
for row in soup.find_all(class_ = "row"):
    titleText = row.find(class_ = "col_title").get_text()
    idText = row.find(class_ = "col_id").get_text()
    results.append((idText, titleText))

edited Feb 11 '18 at 08:53

answered Feb 10 '18 at 15:59

yenter

26
2
5

Can you please check that it works with the example I just added to get filename1 and corresponding content wrap: 6000bc3211af43d7 ? thanks. – K. H. Feb 10 '18 at 19:41
I've updated the answer. You can (and should) read more about BeautifulSoup as well, which is linked in the other answer. – yenter Feb 11 '18 at 08:53

python HTML parsing issue

3 Answers3