4

I am trying to scrape the companies in the li in the ul table under final result. The source code looks like this

import string
import re
import urllib2
import datetime
import bs4
from bs4 import BeautifulSoup

class AJSpider(object):

    def __init__(self):
        print ("initisizing")
        self.date = str(datetime.date.today())
        self.cur_url = "https://youinvest.moneyam.com/modules/forward-diary/?date={date}&period=month"
        self.datas = []
        print ("initisization done")


    def get_page(self,cur_date):
        url = self.cur_url
        try:
            my_page = urllib2.urlopen(url.format(date = cur_date)).read().decode("utf-8")
            my_soup = BeautifulSoup(my_page, "html.parser")

        except:
            print ('Failed')
        return my_soup

    def get_final(self, soup_page):
        temp_data = []
        final_result_section = soup_page.find("h3", text="Final Result")
        print final_result_section

    def start_spider(self):
        my_page = self.get_page(self.date)
        self.get_final(my_page)

def main():

    my_spider = AJSpider()
    my_spider.start_spider()

if __name__ == '__main__':
    main()

I found a similar quesiton in stackoverflow Beautiful Soup: Accessing <li> elements from <ul> with no id , but this one here does have a class id, which makes things a lot easier.

In my scenario, how may I extract the li element from the ul table please? the only identifier here is really the content of the h3 tag, which is Final Result, however it is not a id so I have no idea how to make use of it.

Community
  • 1
  • 1
Victor
  • 659
  • 3
  • 8
  • 19

1 Answers1

3

Find the h3 element by text and get the following ul list:

ul = soup.find("h3", text="Final Result").find_next_sibling("ul")
for li in ul.find_all("li"):
    print(li.span.get_text(), li.a.get_text())

Note that in the recent versions of BeautifulSoup, text argument was renamed to string, but they both work because of the backwards compatibility.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • For the benefit of the person that asked the question, here's the two-line output from this code, for the fragment of HTML in the question: 01 Sep 16 Hays PLC [HAS] 01 Sep 16 Alumasc Group PLC [ALU] – Bill Bell Sep 05 '16 at 21:47
  • Many thanks. I tried it, but I got TypeError: find() takes no keyword arguments, and I tried both text and string, how this error is raised please? – Victor Sep 05 '16 at 21:49
  • @Victor looks like you are using `BeautifulSoup` 3 - if you have the following import: `from BeautifulSoup import BeautifulSoup`, you should upgrade! Install Beautifulsoup 4 via: `pip install beautifulsoup4` and change your import to `from bs4 import Beautifulsoup`. – alecxe Sep 05 '16 at 21:51
  • I do use bs4 import bs4 from bs4 import BeautifulSoup – Victor Sep 05 '16 at 21:53
  • @Victor okay, edit the question and post your complete code please. – alecxe Sep 05 '16 at 21:54
  • @Victor sure, you are returning `my_page` which from the `get_page` function, but `my_page` is not a `BeautifulSoup` instance. I think you meant to return `soup`. – alecxe Sep 05 '16 at 21:58
  • Wonderful!! Silly me, I was experimenting and forgot to change this. Many thanks! – Victor Sep 05 '16 at 22:01
  • @Victor glad to help, see if you can accept the answer to resolve the topic. Thanks. – alecxe Sep 05 '16 at 22:02
  • @alecx In fact, may I further ask a question please? is there a quick way to extract the text between [ ] please? – Victor Sep 05 '16 at 22:08
  • @Victor you would probably need some splitting, or apply a regular expression. Please consider creating a separate question if you need help with it. Thanks! – alecxe Sep 05 '16 at 22:14
  • @alecxe ok, many thanks, I will first spend some time to dig the solution and may create another question if necessary – Victor Sep 05 '16 at 22:17