0

I am having trouble parsing HTML in Python. I'm looking for a solution of how to use Regex specifically for this solution, I'm not looking for why I shouldn't do this with Regex. There might be other solutions that could solve this better, however my requirement unfortunately cannot use other modules or libraries, thanks for the help

I have the following HTML:

<tbody ID='archive'>
    <tr><td valign="top">Type / Path</td>
        <td colspan=2>CIFS / 10.5.0.5:/selva</td>
    </tr>
    <tr><td valign="top">Last availability</td>
        <td colspan=2>1970-01-01 05:30:00</td>
    </tr>
    <tr><td valign="top">Capacity Internal / Archive</td>
        <td colspan=2>3.7 / 10.0 GByte</td>
    </tr>
    <tr><td valign="top">Blocks To sync / Transferred / Lost</td>
        <td colspan=2>951 / 0 / 15 (last 24 hours)</td>
    </tr>
    <tr><td valign="top">Bandwidth Available / Total usage</td>
        <td colspan=2>0 kB/s / 0 kB/s</td>
    </tr>
    <tr><td valign="top">Buffer Usage / Capacity left</td>
        <td colspan=2>100 % / 0 m</td>
    </tr>
</tbody>
<tr bgcolor="#CCCCCC"><th onclick="showhide(this,'events')" align=left colspan=3 width="style: auto;">&#x25BD;&#x0020;Event and Action Setup</th></tr>
<tbody ID='events'>
    <tr>
        <td>Arming</td>
        <td>Enabled</td>
    </tr>
    <tr>
        <td>Events</td>
        <td colspan=2>PI MI AS UC TimeSync </td>
    </tr>
    <tr>
        <td>Actions</td>
        <td colspan=2>(IP) REC FR</td>
    </tr>
</tbody>

I need to get the number which comes after the Buffer Usage element (line 17 in the code above); in this case it is 100% (line 18 in the code above), and this number can have 1 to 3 digits.

How do I get this number extracted from the code above in Python?

The reason I need to do this is so I can send out an email if the buffer is above 10%. I can code that part, but I don't know how to extract the information from the HTML above.

The code will be run on a NAS box, where it would ideal if the solution used only Python standard libraries.

Anand Davis
  • 101
  • 1
  • 11
  • Split by your string, take the first element of the list, go through the string character by character until you get simmering different from 01234556789. – Martin Thoma May 20 '15 at 12:10
  • As you can see from the discussion at the answers, this is a question that a) has been asked many times before, and b) people can't agree on an answer. So even though there are solutions that technically work, the bad spirit which is provoked is bad for SO as a site of Good Answers to Good Questions. I will vote to close your question because of this: "Many good questions generate some degree of opinion based on expert experience, but answers to this question will tend to be almost entirely based on opinions, rather than facts, references, or specific expertise." –  May 20 '15 at 13:27
  • @Lutz it was never my intention to provoke bad spirit, was only looking for an answer to a question that I did not have experience in. The question was very specific to the problem I had. I did not ask for an opinion and at the same time cannot stop someone from stating their opion. I'm sure there are moderators that are fair in deciding on your motion to close the question. In good spirit I sincerely thank you for your efforts in helping me, as well as everyone that answered by question. – Anand Davis May 20 '15 at 15:02
  • @Lutz If you must know why I'm trying to avoid beautiful soup for my solution, is because it is used in an environment where we would required the entire source code beautiful soup to be verified by my clients change control, which would take months and that is a problem for reasons too long to state in this comment. In no way do I imply that your solution is incorrect, sometime we need to do things that appease management as well – Anand Davis May 20 '15 at 15:04

4 Answers4

3

Anand Davis, please try this for a start:

from bs4 import BeautifulSoup

html = """<tbody ID='archive'>
<tr><td valign="top">Type / Path</td>
<td colspan=2>CIFS / 10.5.0.5:/selva</td>
</tr>
<tr><td valign="top">Last availability</td>
<td colspan=2>1970-01-01 05:30:00</td>
</tr>
<tr><td valign="top">Capacity Internal / Archive</td>
<td colspan=2>3.7 / 10.0 GByte</td>
</tr>
<tr><td valign="top">Blocks To sync / Transferred / Lost</td>
<td colspan=2>951 / 0 / 15 (last 24 hours)</td>
</tr>
<tr><td valign="top">Bandwidth Available / Total usage</td>
<td colspan=2>0 kB/s / 0 kB/s</td>
</tr>
<tr><td valign="top">Buffer Usage / Capacity left</td>
<td colspan=2>100 % / 0 m</td>
</tr>
</tbody>
<tr bgcolor="#CCCCCC"><th onclick="showhide(this,'events')" align=left colspan=3 width="style: auto;">&#x25BD;&#x0020;Event and Action Setup</th></tr>
<tbody ID='events'>
<tr><td>Arming</td>
<td>Enabled</td>
</tr>
<tr><td>Events</td>
<td colspan=2>PI MI AS UC TimeSync </td>
</tr>
<tr><td>Actions</td>
<td colspan=2>(IP) REC FR</td>
</tr>
</tbody>"""

html = BeautifulSoup(html)
trs = html.find_all('tr')
for td in trs:
    if "Buffer Usage / Capacity left" in td.text:
        print td.find_all("td")[1].text.split(" ")[0]

Output: 100

In tr variable you will get list of all the rows containing individual elements as per your requirement. You can further apply certain operations on this list as per your requirement. Please refer to Beautiful Soup documentation here

Jatin Bansal
  • 875
  • 12
  • 24
3

You can pass text=re.compile("Buffer Usage") to find the td that contains the contains the text Buffer Usage then get the next td tag and extract the usage with re.

from bs4 import BeautifulSoup

soup= BeautifulSoup(html)
import re
txt = soup.find("td",text=re.compile("Buffer Usage")).find_next("td").text
print(re.search("\d+",txt).group())
100

If there is always a space you can split:

print(txt.split(None,1)[0])

Or if other numbers can come before search for the number before % :

print(re.search("(\d+)\s+%",txt).group(1))
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
2

Using BeautifulSoup you can access the parts of your HTML.

The following code snippet extracts the usage as an integer, but assumes that the structure of the page is always the same. It takes the 2nd column in the 5th row and parses it using a regex.

from bs4 import BeautifulSoup # A library with which to parse HTML (fragments)
import re

s = '''<tbody ID='archive'>
<tr><td valign="top">Type / Path</td>
<td colspan=2>CIFS / 10.5.0.5:/selva</td>
</tr>
<tr><td valign="top">Last availability</td>
<td colspan=2>1970-01-01 05:30:00</td>
</tr>
<tr><td valign="top">Capacity Internal / Archive</td>
<td colspan=2>3.7 / 10.0 GByte</td>
</tr>
<tr><td valign="top">Blocks To sync / Transferred / Lost</td>
<td colspan=2>951 / 0 / 15 (last 24 hours)</td>
</tr>
<tr><td valign="top">Bandwidth Available / Total usage</td>
<td colspan=2>0 kB/s / 0 kB/s</td>
</tr>
<tr><td valign="top">Buffer Usage / Capacity left</td>
<td colspan=2>100 % / 0 m</td>
</tr>
</tbody>
<tr bgcolor="#CCCCCC"><th onclick="showhide(this,'events')" align=left colspan=3 width="style: auto;">&#x25BD;&#x0020;Event and Action Setup</th></tr>
<tbody ID='events'>
<tr><td>Arming</td>
<td>Enabled</td>
</tr>
<tr><td>Events</td>
<td colspan=2>PI MI AS UC TimeSync </td>
</tr>
<tr><td>Actions</td>
<td colspan=2>(IP) REC FR</td>
</tr>
</tbody>'''

doc = BeautifulSoup(s)
row = doc.find_all('tr')[5]
column = row.find_all('td')[1] 
usage_string = column.get_text()

r = re.match(r'(\d{0,3}) % .+', usage_string)
usage = int(r.group(1))

If the page content is a bit more dynamic, you need to write code that finds the correct row instead of picking it out by index like this.

The BeautifulSoup documentation should give you all information you need to refine the code if necessary.

A possibilty would be to check for the "archive" ID and then scan the rows checking the first TD for the "Buffer Usage" string.

wonderb0lt
  • 2,035
  • 1
  • 23
  • 37
  • What do i do if the location of "Buffer Usage" is not always on the 6th Row? because the code can be quite dynamic. – Anand Davis May 20 '15 at 12:24
  • Instead of just picking out the 5th row, iterate over the result of `tr.find_all('tr')` and check if the string 'Buffer Usage' appears in the first column. If so, continue as above. – wonderb0lt May 20 '15 at 12:30
1

As the other answers point out, regexes are not suited to parse html. See this answer. However, if you cannot install a proper parsing library like Beautiful Soap, regexes are your best bet. A regex that will solve the problem as desired is:

import re
text ="""<tr><td valign="top">Buffer Usage / Capacity left</td>  
<td colspan=2>100 % / 0 m</td>"""
result = re.search(r"Buffer Usage.*\n.*?>(\d{1,3}) % .+",text).group(1)
print result # 100
Community
  • 1
  • 1
Sebastian Wozny
  • 16,943
  • 7
  • 52
  • 69
  • 4
    [Oh my god!](http://stackoverflow.com/a/1732454/1907906) –  May 20 '15 at 12:11
  • 1
    **Moderator Note**: If this is a bad idea; your *answer* should include what you think is the 'right' idea, the library or code needed to solve the OP's issue 'the right way', and the code that will solve the OP's issue. – George Stocker May 20 '15 at 14:41
  • @GeorgeStocker As you with enough privileges can see below, I proposed an answer that answers the question in what I think is the correct way. Sadly this answer was downvoted because of the discussion here that you've now removed. Answers that started out almost useless now are still here with upvotes. All this is very sad. –  May 20 '15 at 15:06
  • 1
    I agree with Lutz. His final answer was of higher quality than mine for the generic case. However the auther specifially asked for something that will run with packages that are in the standard library only, so I think my solution is correct. – Sebastian Wozny May 20 '15 at 15:09