-1

Given an html page, I would like to only get an array of variables like this (id1, value1), (id2, value2), ...., the file is given like this:

    <div class="col m3 s12 col_title"><div class="font-small grey-text truncate content" title="value1">value1</div></div>
    <div class="col m7 s12 col_id"><div class="content wrap">id1</div></div>

every value is followed by a "content wrap" id. I was thinking of something like:

match = re.compile('title="(.+?)".+?wrap"(.+?)"').findall(source)


This is an example:

<li class="collection-item Ids ">
<div class="row">
    <div class="col m3 s12 col_title"><div class="font-small grey-text truncate content" title="filename1">filename1</div></div>
    <div class="col m7 s12 col_id"><div class="content wrap">6000bc3211af43d7</div></div>
    <div></div>
    <div class="col m2 s12 col_time">
        <div class="content">
            <a href="http://test.com/test.php" target="_blank" class="secondary-content pull-right">
                <span class="font-small grey-text" title="filex">test</span>
                <i class="fa fa-external-link" aria-hidden="true" title="filey"></i>
            </a>
        </div>
    </div>
</div>

K. H.
  • 9
  • 2

3 Answers3

0

You can try to use Beautiful Soup, it should have everything you need for parsing HTML.

For exemple, you could use :

# open the html from the website or from a file, check the doc
soup = BeautifulSoup(urllib.urlopen(yoururl), "lxml") 
result = soup.find_all(class_="content wrap").get_text()

Here, result would be an array containing all the text contents inside the elements that have a "content wrap" class.

TheWildHealer
  • 1,546
  • 1
  • 15
  • 26
0

Can you show the example for id1 and value1? I have a idea :D

\w{1,}\d{1,}< And getting from 1 to len(match)-1 It can not true.

AJackTi
  • 63
  • 2
  • 7
0

Building on TheWildHealer's answer, you can use the following:

from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("http://websitehere.com").text, "lxml")
results = []
for row in soup.find_all(class_ = "row"):
    titleText = row.find(class_ = "col_title").get_text()
    idText = row.find(class_ = "col_id").get_text()
    results.append((idText, titleText))
yenter
  • 26
  • 2
  • 5
  • Can you please check that it works with the example I just added to get filename1 and corresponding content wrap: 6000bc3211af43d7 ? thanks. – K. H. Feb 10 '18 at 19:41
  • I've updated the answer. You can (and should) read more about BeautifulSoup as well, which is linked in the other answer. – yenter Feb 11 '18 at 08:53