Problem in web scraping with Beautiful Soup and urllib

Question

Helllo!

I'm doing some scraping on the Premier League website and I'm running into the following problem. When I run this:

my_url = 'https://www.premierleague.com/match/{}'.format(i)
client = urlopen(my_url)
page_html = client.read()

this specific part of the page_html is returned like this:

<div class="matchDate renderMatchDateContainer" data-kickoff="1583784000000"></div>

when it was supposed to be like this, as I see on the browser:

<div class="matchDate renderMatchDateContainer" data-kickoff="1583784000000">Mon 9 Mar 2020</div>

as a result I cannot scrape the date 'Mon 9 Mar 2020'.

Can anyone help? Thanks!

what are you formatting? `.format(i)`? can you post more code — 0m3r, Apr 15 '20 at 02:06
if page use JavaScript to add data then you need Selenium instead of BS and urllib because BS and urllib can't run JavaScript. — furas, Apr 15 '20 at 02:26
Does this answer your question? [How to convert integer timestamp to Python datetime](https://stackoverflow.com/questions/9744775/how-to-convert-integer-timestamp-to-python-datetime) — Joe, Apr 15 '20 at 04:53
Hi @0m3r. The `.format(i)` is just so I can acess multiple pages with a for loop. Each url would be like this: https://www.premierleague.com/match/46605 — otavios, Apr 15 '20 at 11:43
Hi @furas. I really do not know if the page uses JavaScript, but I was able to scrap lots of data using BS. I'm only struggling with the dates and times. — otavios, Apr 15 '20 at 11:46
Hello Joe. It helps, yes. But I still do not know how to excrat the number "1583784000000". — otavios, Apr 15 '20 at 11:47
if you have in HTML `
` then use something like `find('div', {'data-kickoff': True})["data-kickoff"]` — furas, Apr 15 '20 at 14:29

score 0 · Answer 1 · answered Apr 15 '20 at 03:55

0

The 1583685000 of data-kickoff=1583685000000 represents 2020/03/09, are you doing the math with JavaScript? Why don't you try to convert this data?

num = 1583685000000
s = str(num)
date = int(s[0:-3])
d = datetime.date.fromtimestamp(date)
d.strftime('%d/%m/%y')

'09/03/20'

answered Apr 15 '20 at 03:55

r-beginners

31,170
3
14
32

Hello. That is actually helpful, I was not seeing like this, so thanks! But I do not know how to extract this number from the tag as it's not the tag's text, can you help me with that? – otavios Apr 15 '20 at 11:50
``` dates = soup.find_all('div', class_='matchDate renderMatchDateContainer') for d in dates: dd = d['data-kickoff'] print(dd) ``` – r-beginners Apr 15 '20 at 13:24
It worked like this: 'date = page_soup.find('div', class_='matchDate renderMatchDateContainer')['data-kickoff']', thanks a lot! But, do you know why the `page_html` doesn't come with the text like I showed in the question? When I inspect the page in the browser I can see the text (as I showed in the image) and every other part of the html comes with the text. I don't understand that. But thanks a lot! – otavios Apr 15 '20 at 13:49
I'm also not sure why it doesn't have value. I'm happy to help in any way I can. – r-beginners Apr 15 '20 at 14:17
@OtávioSimõesSilveira in browser you see HTML with changes made by JavaScript. But BS can't run JavaScript and it gives you HTML without changes. – furas Apr 15 '20 at 14:35
All right, I'm making progress with selenium now, but still not perfect. When I get the HTML element with selenium it stil has the value inseide the class, however, when I parse it using BS it loses the value. Can I parse it with selenium? – otavios Apr 15 '20 at 18:23
You can also get it with selenium. If you want to do more advanced scraping, selenium is a must. – r-beginners Apr 16 '20 at 02:55
it worked with this]:> "date = driver.find_element_by_xpath("//div[@class='matchInfo']//div[@class='matchDate renderMatchDateContainer']").text" thanks guys! – otavios Apr 16 '20 at 18:22

Problem in web scraping with Beautiful Soup and urllib

1 Answers1