Retrieve the content of an html page under conditions (with python lxml)

Question

I'll tell you my problem.(Sorry for my english)

I have to connect to a server every day to retrieve content.

The page on which I am connecting is in this form:

<tr><td><a href='https://www.test.com/thing1.xlsx' target='_blank'>thing1.xlsx</a><td>01 September 2019 10:02:03 /td><td>1 KB</td></tr>
<tr><td><a href='https://www.test.com/thing2.pdf' target='_blank'>thing2.pdf</a><td>02 September 2019 10:02:03 /td><td>1 KB</td></tr>
<tr><td><a href='https://www.test.com/thing test 3.pdf' target='_blank'>thing test 3.pdf</a><td>04 September 2019 10:02:03 /td><td>1 KB</td></tr>
<tr><td><a href='https://www.test.com/thing test 4.pdf' target='_blank'>thing test 4.pdf</a><td>04 September 2019 10:02:04 /td><td>1 KB</td></tr>
<tr><td><a href='https://www.test.com/thing test 5.pdf' target='_blank'>thing test 5.pdf</a><td>04 September 2019 10:02:05 /td><td>1 KB</td></tr>

From this page (content will be added continuously) I must retrieve the urls (under href) of the files on the current date. For example, if today is September 04, I have to get my 3 files: "thing test 3.pdf", "thing test 4.pdf" and "thing test 5.pdf" (we notice here that some URLs have spaces).

I started writing a script in python (with lxml), but I'm beginner I could use some help.

# coding: utf-8

from lxml import etree, html
parser = etree.HTMLParser()
tree   = etree.parse("test.html", parser)

URL = tree.xpath('//a/@href')
NAMEFILE = tree.xpath('//a/text()')

print URL

I am able to get my urls but not by today's date. Any ideas?

score 1 · Answer 1 · answered Sep 04 '19 at 16:35

1

For me, the best way is using Beautiful Soup library, you can find detailed info on: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In your specific case (get dates in the html) this code should be work:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('test.html'), 'html.parser')

for tag in soup.find_all('tr'):
    print (tag.find_all('td')[1].text)

answered Sep 04 '19 at 16:35

Alvaro Cuervo

104
5

Unfortunately on the server it is impossible to find Beautiful Soup (I am on redhat) and it is cut off from the internet (proxy) so I can't download it. Do you have any other ideas? – Prodiguy Sep 11 '19 at 09:11
@Prodiguy check this, maybe helps: https://stackoverflow.com/questions/18701464/beautifulsoup-installation-or-alternative-without-easy-install – Alvaro Cuervo Sep 13 '19 at 15:10
@Prodiguy You can find the last source files of Beautiful Soup on https://pypi.org/project/beautifulsoup4/#files – Alvaro Cuervo Sep 13 '19 at 15:13

score 0 · Answer 2 · answered Sep 04 '19 at 16:59

0

from lxml import etree, html
parser = etree.HTMLParser()
tree   = etree.parse("test.html", parser)

URL = tree.xpath('//a/@href')
NAMEFILE = tree.xpath('//a/text()')

print URL

dates = []
example = "01 September 2019 10:02:03 /td>"
date_tds = tree.findall('.//td')
for i in date_tds:
    if len(str(i.text)) == len(example):
        dates.append(str(i.text).split(" 10")[0])

for index,i in enumerate(dates):
    if "01 September 2019" in i:
        print(URL[index])

If you want to check if it is from todays date. Best way would be to converse the string of the date to a datetime object and check by todays date.

answered Sep 04 '19 at 16:59

PySeeker

818
8
12

I tried your solution but unfortunately it doesn't work :( I have no result – Prodiguy Sep 11 '19 at 09:09
Are you sure? My solution with the input data from you is: https://www.test.com/thing1.xlsx The list dates contains all date tds. – PySeeker Sep 11 '19 at 12:28

Retrieve the content of an html page under conditions (with python lxml)

2 Answers2