0

Helllo!

I'm doing some scraping on the Premier League website and I'm running into the following problem. When I run this:

my_url = 'https://www.premierleague.com/match/{}'.format(i)
client = urlopen(my_url)
page_html = client.read()

this specific part of the page_html is returned like this:

<div class="matchDate renderMatchDateContainer" data-kickoff="1583784000000"></div>

when it was supposed to be like this, as I see on the browser:

<div class="matchDate renderMatchDateContainer" data-kickoff="1583784000000">Mon 9 Mar 2020</div>

You can also see it here

as a result I cannot scrape the date 'Mon 9 Mar 2020'.

Can anyone help? Thanks!

otavios
  • 124
  • 9
  • 1
    what are you formatting? `.format(i)`? can you post more code – 0m3r Apr 15 '20 at 02:06
  • if page use JavaScript to add data then you need Selenium instead of BS and urllib because BS and urllib can't run JavaScript. – furas Apr 15 '20 at 02:26
  • Does this answer your question? [How to convert integer timestamp to Python datetime](https://stackoverflow.com/questions/9744775/how-to-convert-integer-timestamp-to-python-datetime) – Joe Apr 15 '20 at 04:53
  • Hi @0m3r. The `.format(i)` is just so I can acess multiple pages with a for loop. Each url would be like this: https://www.premierleague.com/match/46605 – otavios Apr 15 '20 at 11:43
  • Hi @furas. I really do not know if the page uses JavaScript, but I was able to scrap lots of data using BS. I'm only struggling with the dates and times. – otavios Apr 15 '20 at 11:46
  • Hello Joe. It helps, yes. But I still do not know how to excrat the number "1583784000000". – otavios Apr 15 '20 at 11:47
  • if you have in HTML `
    ` then use something like `find('div', {'data-kickoff': True})["data-kickoff"]`
    – furas Apr 15 '20 at 14:29

1 Answers1

0

The 1583685000 of data-kickoff=1583685000000 represents 2020/03/09, are you doing the math with JavaScript? Why don't you try to convert this data?

num = 1583685000000
s = str(num)
date = int(s[0:-3])
d = datetime.date.fromtimestamp(date)
d.strftime('%d/%m/%y')

'09/03/20'

r-beginners
  • 31,170
  • 3
  • 14
  • 32
  • Hello. That is actually helpful, I was not seeing like this, so thanks! But I do not know how to extract this number from the tag as it's not the tag's text, can you help me with that? – otavios Apr 15 '20 at 11:50
  • ``` dates = soup.find_all('div', class_='matchDate renderMatchDateContainer') for d in dates: dd = d['data-kickoff'] print(dd) ``` – r-beginners Apr 15 '20 at 13:24
  • It worked like this: 'date = page_soup.find('div', class_='matchDate renderMatchDateContainer')['data-kickoff']', thanks a lot! But, do you know why the `page_html` doesn't come with the text like I showed in the question? When I inspect the page in the browser I can see the text (as I showed in the image) and every other part of the html comes with the text. I don't understand that. But thanks a lot! – otavios Apr 15 '20 at 13:49
  • I'm also not sure why it doesn't have value. I'm happy to help in any way I can. – r-beginners Apr 15 '20 at 14:17
  • @OtávioSimõesSilveira in browser you see HTML with changes made by JavaScript. But BS can't run JavaScript and it gives you HTML without changes. – furas Apr 15 '20 at 14:35
  • All right, I'm making progress with selenium now, but still not perfect. When I get the HTML element with selenium it stil has the value inseide the class, however, when I parse it using BS it loses the value. Can I parse it with selenium? – otavios Apr 15 '20 at 18:23
  • You can also get it with selenium. If you want to do more advanced scraping, selenium is a must. – r-beginners Apr 16 '20 at 02:55
  • it worked with this]:> "date = driver.find_element_by_xpath("//div[@class='matchInfo']//div[@class='matchDate renderMatchDateContainer']").text" thanks guys! – otavios Apr 16 '20 at 18:22