34

I am looking for a way to get certain info from HTML in linux shell environment.

This is bit that I'm interested in :

<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
  <tr valign="top">
    <th>Tests</th>
    <th>Failures</th>
    <th>Success Rate</th>
    <th>Average Time</th>
    <th>Min Time</th>
    <th>Max Time</th>
  </tr>
  <tr valign="top" class="Failure">
    <td>103</td>
    <td>24</td>
    <td>76.70%</td>
    <td>71 ms</td>
    <td>0 ms</td>
    <td>829 ms</td>
  </tr>
</table>

And I want to store in shell variables or echo these in key value pairs extracted from above html. Example :

Tests         : 103
Failures      : 24
Success Rate  : 76.70 %
and so on..

What I can do at the moment is to create a java program that will use sax parser or html parser such as jsoup to extract this info.

But using java here seems to be overhead with including the runnable jar inside the "wrapper" script you want to execute.

I'm sure that there must be "shell" languages out there that can do the same i.e. perl, python, bash etc.

My problem is that I have zero experience with these, can somebody help me resolve this "fairly easy" issue

Quick update:

I forgot to mention that I've got more tables and more rows in the .html document sorry about that (early morning).

Update #2:

Tried to install Bsoup like this since I don't have root access :

$ wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz
$ tar -zxvf beautifulsoup4-4.1.0.tar.gz
$ cp -r beautifulsoup4-4.1.0/bs4 .
$ vi htmlParse.py # (paste code from ) Tichodromas' answer, just in case this (http://pastebin.com/4Je11Y9q) is what I pasted
$ run file (python htmlParse.py)

error:

$ python htmlParse.py
Traceback (most recent call last):
  File "htmlParse.py", line 1, in ?
    from bs4 import BeautifulSoup
  File "/home/gdd/setup/py/bs4/__init__.py", line 29
    from .builder import builder_registry
         ^
SyntaxError: invalid syntax

Update #3 :

Running Tichodromas' answer get this error :

Traceback (most recent call last):
  File "test.py", line 27, in ?
    headings = [th.get_text() for th in table.find("tr").find_all("th")]
TypeError: 'NoneType' object is not callable

any ideas?

Gandalf StormCrow
  • 25,788
  • 70
  • 174
  • 263
  • 2
    There is a nice library for python that might help: BeautifulSoup -> http://www.crummy.com/software/BeautifulSoup/bs4/doc/ . – Jakob S. Aug 03 '12 at 06:53
  • @Jakob S. thank you for the comment, as I told you I'm a newbie so I downloaded tarbal and tried to install it `python setup.py install` get this permission error `error: could not create '/usr/lib/python2.4/site-packages/bs4': Permission denied`, how do I specify in which directory to install it. Is there something similar to `-prefix` when installing other commands – Gandalf StormCrow Aug 03 '12 at 07:06
  • I have to admit I am not sure how to achieve this if you don't have root access - and I don't have Linux here at the moment to try. In principal it should be possible to simply copy the package to the correct directory relative to your source .py file, so that it can be found by the interpreter. – Jakob S. Aug 03 '12 at 07:14
  • See the doc: "If all else fails, the license for Beautiful Soup allows you to package the entire library with your application. You can download the tarball, copy its bs4 directory into your application’s codebase, and use Beautiful Soup without installing it at all." ( http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup ) – Jakob S. Aug 03 '12 at 07:16
  • How can the `table` you are interested in be recognized? By position, by ID, ...? –  Aug 03 '12 at 07:23
  • 1
    You could/should install bs4 in a separate virtualenv. You'll have pseudo root privileges in it. – Balthazar Rouberol Aug 03 '12 at 07:29
  • I don't have any privileges only user ones, still something I could do? – Gandalf StormCrow Aug 03 '12 at 07:34
  • @GandalfStormCrow Try this: `$ virtualenv bs4ve; cd bs4ve; source bin/activate; pip install bs4`. Does this work? –  Aug 03 '12 at 07:37
  • unfortunately `-bash: mkvirtualenv: command not found` – Gandalf StormCrow Aug 03 '12 at 07:50
  • @GandalfStormCrow Then ask your admin to install virtualenv for you. How can one work with crippled tools :( –  Aug 03 '12 at 07:51
  • Please see my update, maybe I did something wrong, my admin is 7hrs away in other time zone :D – Gandalf StormCrow Aug 03 '12 at 08:00
  • @GandalfStormCrow This won't work without using `setup.py`. –  Aug 03 '12 at 08:01
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/14838/discussion-between-tichodroma-and-gandalf-stormcrow) –  Aug 03 '12 at 08:02
  • I'm in there don't know how it works – Gandalf StormCrow Aug 03 '12 at 08:17
  • @Tichodroma I did manage to install the older version of bsoup – Gandalf StormCrow Aug 03 '12 at 08:54

7 Answers7

57

A Python solution using BeautifulSoup4 (Edit: with proper skipping. Edit3: Using class="details" to select the table):

from bs4 import BeautifulSoup

html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
</table>"""

soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})

# The first tr contains the field names.
headings = [th.get_text() for th in table.find("tr").find_all("th")]

datasets = []
for row in table.find_all("tr")[1:]:
    dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
    datasets.append(dataset)

print datasets

The result looks like this:

[[(u'Tests', u'103'),
  (u'Failures', u'24'),
  (u'Success Rate', u'76.70%'),
  (u'Average Time', u'71 ms'),
  (u'Min Time', u'0 ms'),
  (u'Max Time', u'829 ms')]]

Edit2: To produce the desired output, use something like this:

for dataset in datasets:
    for field in dataset:
        print "{0:<16}: {1}".format(field[0], field[1])

Result:

Tests           : 103
Failures        : 24
Success Rate    : 76.70%
Average Time    : 71 ms
Min Time        : 0 ms
Max Time        : 829 ms
kmonsoor
  • 7,600
  • 7
  • 41
  • 55
  • thank you for your answer, answer to your comment above. can I use the class as identifier, I don't have ID ? class would be `details` – Gandalf StormCrow Aug 03 '12 at 07:41
  • @GandalfStormCrow Yes, this can be done. I've edited my answer. –  Aug 03 '12 at 07:46
  • Is it certain that this answer actually works in Python 2.4? @Gandalf, you said in a comment that you installed "the older version of bsoup" (BeautifulSoup 3, I presume). And the line saying "I'm using Python 2.4.3" is gone. So this is a bit confusing. – mzjn Aug 03 '12 at 11:18
  • Python 2.4.3 was [released](http://www.python.org/download/releases/2.4.3/NEWS.txt) on 29-MAR-2006! I think an update would be advisable. –  Aug 03 '12 at 14:00
  • Just a tiny note for people that I think you're going to need a "soup = BeautifulSoup(html)" in there before the table = soup.find... – Ezekiel Kruglick Dec 16 '13 at 22:02
  • 2
    I've got: print(datasets) [, ] while headings are ok. – Peter.k Nov 15 '17 at 21:46
  • For anyone with @Peter.k's issue, add ```tuple()``` to the ```datasets.append``` line, like this: ```datasets.append(tuple(dataset))```. More discussion [here](https://stackoverflow.com/questions/63380598/using-beautifulsoup-to-extract-a-table-in-python-3). – Roxanne Ready Aug 12 '20 at 17:38
17

Use pandas.read_html:

import pandas as pd
html_tables = pd.read_html('resources/test.html')
df = html_tables[0]
df.T # transpose to align
                   0
Tests            103
Failures          24
Success Rate  76.70%
Average Time   71 ms
Jordan Valansi
  • 181
  • 2
  • 5
7

Here is the top answer, adapted for Python3 compatibility, and improved by stripping whitespace in cells:

from bs4 import BeautifulSoup

html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
</table>"""

soup = BeautifulSoup(s, 'html.parser')
table = soup.find("table")

# The first tr contains the field names.
headings = [th.get_text().strip() for th in table.find("tr").find_all("th")]

print(headings)

datasets = []
for row in table.find_all("tr")[1:]:
    dataset = dict(zip(headings, (td.get_text() for td in row.find_all("td"))))
    datasets.append(dataset)

print(datasets)
Michel Müller
  • 5,535
  • 3
  • 31
  • 49
3

Assuming your html code is stored in a mycode.html file, here is a bash way:

paste -d: <(grep '<th>' mycode.html | sed -e 's,</*th>,,g') <(grep '<td>' mycode.html | sed -e 's,</*td>,,g')

note: the output is not perfectly aligned

Stephane Rouberol
  • 4,286
  • 19
  • 18
2

Below is a python regex based solution that I have tested on python 2.7. It doesn't rely on xml module--so will work in case xml is not fully well formed.

import re
# input args: html string
# output: tables as a list, column max length
def extract_html_tables(html):
  tables=[]
  maxlen=0
  rex1=r'<table.*?/table>'
  rex2=r'<tr.*?/tr>'
  rex3=r'<(td|th).*?/(td|th)>'
  s = re.search(rex1,html,re.DOTALL)
  while s:
    t = s.group()  # the table
    s2 = re.search(rex2,t,re.DOTALL)
    table = []
    while s2:
      r = s2.group() # the row 
      s3 = re.search(rex3,r,re.DOTALL)
      row=[]
      while s3:
        d = s3.group() # the cell
        #row.append(strip_tags(d).strip() )
        row.append(d.strip() )

        r = re.sub(rex3,'',r,1,re.DOTALL)
        s3 = re.search(rex3,r,re.DOTALL)

      table.append( row )
      if maxlen<len(row):
        maxlen = len(row)

      t = re.sub(rex2,'',t,1,re.DOTALL)
      s2 = re.search(rex2,t,re.DOTALL)

    html = re.sub(rex1,'',html,1,re.DOTALL)
    tables.append(table)
    s = re.search(rex1,html,re.DOTALL)
  return tables, maxlen

html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
</table>"""
print extract_html_tables(html)
paolov
  • 2,139
  • 1
  • 34
  • 43
1
undef $/;
$text = <DATA>;

@tabs = $text =~ m!<table.*?>(.*?)</table>!gms;
for (@tabs) {
    @th = m!<th>(.*?)</th>!gms;
    @td = m!<td>(.*?)</td>!gms;
}
for $i (0..$#th) {
    printf "%-16s\t: %s\n", $th[$i], $td[$i];
}

__DATA__
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>

output as follows:

Tests               : 103
Failures            : 24
Success Rate        : 76.70%
Average Time        : 71 ms
Min Time            : 0 ms
Max Time            : 829 ms
cdtits
  • 1,118
  • 6
  • 7
1

A Python solution that uses only the standard library (takes advantage of the fact that the HTML happens to be well-formed XML). More than one row of data can be handled.

(Tested with Python 2.6 and 2.7. The question was updated saying that the OP uses Python 2.4, so this answer may not be very useful in this case. ElementTree was added in Python 2.5)

from xml.etree.ElementTree import fromstring

HTML = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
  <tr valign="top">
    <th>Tests</th>
    <th>Failures</th>
    <th>Success Rate</th>
    <th>Average Time</th>
    <th>Min Time</th>
    <th>Max Time</th>
  </tr>
  <tr valign="top" class="Failure">
    <td>103</td>
    <td>24</td>
    <td>76.70%</td>
    <td>71 ms</td>
    <td>0 ms</td>
    <td>829 ms</td>
  </tr>
  <tr valign="top" class="whatever">
    <td>A</td>
    <td>B</td>
    <td>C</td>
    <td>D</td>
    <td>E</td>
    <td>F</td>
  </tr>
</table>"""

tree = fromstring(HTML)
rows = tree.findall("tr")
headrow = rows[0]
datarows = rows[1:]

for num, h in enumerate(headrow):
    data = ", ".join([row[num].text for row in datarows])
    print "{0:<16}: {1}".format(h.text, data)

Output:

Tests           : 103, A
Failures        : 24, B
Success Rate    : 76.70%, C
Average Time    : 71 ms, D
Min Time        : 0 ms, E
Max Time        : 829 ms, F
mzjn
  • 48,958
  • 13
  • 128
  • 248
  • thank you for your answer. Instead of reading from a particular html string, can I specify like this : get me a table with `class="details"` from this html file and do what you've just done? – Gandalf StormCrow Aug 03 '12 at 07:42
  • Now it works with more than one data row. I have tested this with Python 2.6 and 2.7, but now I see that you use 2.4.3 (which I don't have). So it may not help you. Anyway, I wanted to show that it is possible to do this kind of thing without extra libraries. – mzjn Aug 03 '12 at 08:56
  • 1
    The string formatting syntax that I (and @Tichodroma) use will not work in 2.4. – mzjn Aug 03 '12 at 09:02
  • *get me a table with class="details" from this html file*. Yes, that can be done using ElementTree (but not with Python 2.4). ElementTree was added in Python 2.5. – mzjn Aug 03 '12 at 09:09