-2

I am new to crawling web page. my code is trying to get the time of the website. I found the location and trying to use xpath to get the text(). But my code always return "[]". Did I miss anything?

# -*- coding: utf-8 -*-
import urllib
from bs4 import BeautifulSoup

from lxml import etree
from lxml import html
import requests
headers= { 'User-Agent' : 'User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36' }

tree = requests.get('https://www.time.gov/',headers=headers).content#.decode('utf-8')


doc_tree = etree.HTML(tree)
links = doc_tree.xpath('//div[@id="lzTextSizeCache"]/div[@class="lzswftext"]/text()')

print links

The location of the html code is :

<div class="lzswftext" style="padding: 0px; overflow: visible; width: auto; height: auto; font-weight: bold; font-style: normal; font-family: Arial, Verdana; font-size: 50px; white-space: pre; display: none;">09:37:26 a.m. </div>
panda001
  • 109
  • 1
  • 3
  • 12
  • could you maybe provide the relevant HTML snippet? – dheiberg Jan 07 '19 at 14:16
  • 3
    That means that it couldn't find the pattern you gave in the page. What''s the desired output? If it's the time displayed on the site, you are aware that's not hardcoded in the *html* code but it's *Javascript*. – CristiFati Jan 07 '19 at 14:18

2 Answers2

0

Your item is generated asynchronously

  • It takes some time for the page to generate the item you're looking for. You can see in the source code of the page some instructions like setTimeout("updatexearthImage()", 10000);
  • Also in the source code, you can see that your item is not part for the initial page. When doing a curl for example

Solution

Try using a headless browser that runs Javascript, you may also need to include some delays in your code for the page to fully render. For example Puppeteer or maybe Selenium

molamk
  • 4,076
  • 1
  • 13
  • 22
0

You won't getting the time because that request doesn't have it:

enter image description here

That's because webpage makes another requests to obtain time. In this particular case, the request is "https://www.time.gov/actualtime.cgi?disablecache=1546870424051&lzbc=wr1d55", it obtains this html:

<timestamp time="1546870996756222" delay="1545324126332171"/>

There are some javascript code that transform that timestamp to date, you can simulate it with python:

In [28]: import requests                                                                                                                                                                                            

In [29]: from datetime import datetime                                                                                                                                                                              

In [30]: res = requests.get('https://www.time.gov/actualtime.cgi?disablecache=1546870424051&__lzbc__=wr1d55')                                                                                                       
2019-01-07 09:34:15 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.time.gov:443
2019-01-07 09:34:16 [urllib3.connectionpool] DEBUG: https://www.time.gov:443 "GET /actualtime.cgi?disablecache=1546870424051&__lzbc__=wr1d55 HTTP/1.1" 200 None

In [31]: from bs4 import BeautifulSoup 
    ...:                                                                                                                                                                                                            

In [32]: soup = BeautifulSoup(res.text, 'html.parser')                                                                                                                                                              

In [34]: soup.timestamp['time']                                                                                                                                                                                     
Out[34]: '1546871656757021'

In [35]: ts = soup.timestamp['time']                                                                                                                                                                                

In [38]: ts = int(soup.timestamp['time'])                                                                                                                                                                           

In [39]: ts /= 1000000     # because timestamp is in microseconds                                                                                                                                                                                         

In [40]: print(datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')) 
    ...:                                                                                                                                                                                                            
2019-01-07 14:34:16

To get the time in your localzone read: Convert UTC datetime string to local datetime with Python.

This may be a overcomplicated solution, also you just could use something like Selenium or scrapy+splash that obtains the same that you see in browser.

Joaquin
  • 2,013
  • 3
  • 14
  • 26