Python 3 extract html information from page

Question

I have been doing some googling but I can't really find a good python3 solution to my problem. Given the following HTML code, how do I extract 2019, 0.7 and 4.50% using python3?

<td rowspan='2' style='vertical-align:middle'>2019</td><td rowspan='2' style='vertical-align:middle;font-weight:bold;'>4.50%</td><td rowspan='2' style='vertical-align:middle;font-weight:bold;'>SGD 0.7</td>   <td>SGD0.2      </td>

Do you know there is ```BeautifulSoup```?, if not then check: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ — JenilDave, Jun 09 '20 at 11:43

score 1 · Accepted Answer · answered Jun 09 '20 at 11:43

1

A solution using BeautifulSoup:

from bs4 import BeautifulSoup

txt = '''<td rowspan='2' style='vertical-align:middle'>2019</td><td rowspan='2' style='vertical-align:middle;font-weight:bold;'>4.50%</td><td rowspan='2' style='vertical-align:middle;font-weight:bold;'>SGD 0.7</td>   <td>SGD0.2      </td>'''

soup = BeautifulSoup(txt, 'html.parser')

info_1, info_2, info_3, *_ = soup.select('td')

info_1 = info_1.get_text(strip=True)
info_2 = info_2.get_text(strip=True)
info_3 = info_3.get_text(strip=True).split()[-1]

print(info_1, info_2, info_3)

Prints:

2019 4.50% 0.7

answered Jun 09 '20 at 11:43

Andrej Kesely

168,389
15
48
91

Thank you. Could you tell me what *_ is? I don't really understand how this code works – user3702643 Jun 09 '20 at 12:05
@user3702643 `a, b, *rest = [1, 2, 3, 4]` is standard python syntax for unpacking iterables (list, tuples, ...) to variables. After this `a` will be `1`, `b` will be `2` and `rest` will be `[3, 4]`. More here https://stackoverflow.com/questions/34308337/unpack-list-to-variables – Andrej Kesely Jun 09 '20 at 12:08
What happens if you have a few other earlier to this required ? This is quite possible if this is a table. @user3702643 needs to take care of this when implementing. – Prateek Jun 09 '20 at 12:16
@Prateek It depends from page to page, each page has it's structure different. OP has to modify the CSS selector(s) accordingly. – Andrej Kesely Jun 09 '20 at 12:18
@AndrejKesely I think he might be right. It is a huge html document from which i only need to extract the above information – user3702643 Jun 09 '20 at 13:46
@user3702643 Does this html document have any unique selector ? I can help improve the answer. – Prateek Jun 09 '20 at 14:52
@user3702643 The correct way is to open new question. Accept one of answer here to close this question, and open new question where you specify the URL of the document (or put short sample from it) and expected output. – Andrej Kesely Jun 09 '20 at 15:03

score -1 · Answer 2 · answered Jun 09 '20 at 11:45

-1

I think this might be helpful if does not exactly answer your question:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        print(data)

parser = MyHTMLParser()
parser.feed("<Your HTML here>")

For your particular case this will return: 2019 4.50% SGD 0.7 SGD0.2

answered Jun 09 '20 at 11:45

Bogdan Androne

1

1

if it doesn't answer the question how it can be helpful? – Sfili_81 Jun 09 '20 at 11:57
it is because it provides something that is very close to the actual data that he needs. please read the question and then my answer :) – Bogdan Androne Jun 09 '20 at 12:02
Please focus on quality answers :) – Prateek Jun 09 '20 at 12:19

Python 3 extract html information from page

2 Answers2