0

I have been doing some googling but I can't really find a good python3 solution to my problem. Given the following HTML code, how do I extract 2019, 0.7 and 4.50% using python3?

<td rowspan='2' style='vertical-align:middle'>2019</td><td rowspan='2' style='vertical-align:middle;font-weight:bold;'>4.50%</td><td rowspan='2' style='vertical-align:middle;font-weight:bold;'>SGD 0.7</td>   <td>SGD0.2      </td>
user3702643
  • 1,465
  • 5
  • 21
  • 48
  • 1
    Do you know there is ```BeautifulSoup```?, if not then check: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ – JenilDave Jun 09 '20 at 11:43

2 Answers2

1

A solution using BeautifulSoup:

from bs4 import BeautifulSoup

txt = '''<td rowspan='2' style='vertical-align:middle'>2019</td><td rowspan='2' style='vertical-align:middle;font-weight:bold;'>4.50%</td><td rowspan='2' style='vertical-align:middle;font-weight:bold;'>SGD 0.7</td>   <td>SGD0.2      </td>'''

soup = BeautifulSoup(txt, 'html.parser')

info_1, info_2, info_3, *_ = soup.select('td')

info_1 = info_1.get_text(strip=True)
info_2 = info_2.get_text(strip=True)
info_3 = info_3.get_text(strip=True).split()[-1]

print(info_1, info_2, info_3)

Prints:

2019 4.50% 0.7
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thank you. Could you tell me what *_ is? I don't really understand how this code works – user3702643 Jun 09 '20 at 12:05
  • @user3702643 `a, b, *rest = [1, 2, 3, 4]` is standard python syntax for unpacking iterables (list, tuples, ...) to variables. After this `a` will be `1`, `b` will be `2` and `rest` will be `[3, 4]`. More here https://stackoverflow.com/questions/34308337/unpack-list-to-variables – Andrej Kesely Jun 09 '20 at 12:08
  • What happens if you have a few other earlier to this required ? This is quite possible if this is a table. @user3702643 needs to take care of this when implementing. – Prateek Jun 09 '20 at 12:16
  • @Prateek It depends from page to page, each page has it's structure different. OP has to modify the CSS selector(s) accordingly. – Andrej Kesely Jun 09 '20 at 12:18
  • @AndrejKesely I think he might be right. It is a huge html document from which i only need to extract the above information – user3702643 Jun 09 '20 at 13:46
  • @user3702643 Does this html document have any unique selector ? I can help improve the answer. – Prateek Jun 09 '20 at 14:52
  • @user3702643 The correct way is to open new question. Accept one of answer here to close this question, and open new question where you specify the URL of the document (or put short sample from it) and expected output. – Andrej Kesely Jun 09 '20 at 15:03
-1

I think this might be helpful if does not exactly answer your question:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        print(data)

parser = MyHTMLParser()
parser.feed("<Your HTML here>")

For your particular case this will return: 2019 4.50% SGD 0.7 SGD0.2