0

I have a XML file which looks like this:

?xml version="1.0" encoding="UTF-8"?>
  <url>
    <lastmod>2020-02-04T16:21:00+01:00</lastmod>
    <loc>https://www.h.com</loc>
  </url>
  <url>
    <lastmod>2020-01-31T17:17:00+01:00</lastmod>
    <loc>https://www.h.com</loc>
  </url>
  <url>
    <lastmod>2020-01-27T13:53:00+01:00</lastmod>
    <loc>https://www.h.coml</loc>
  </url>

A datetime.date which looks like this:

datetime.date(2020, 02, 01)

Is it possible to use BeautifulSoup to delete/igonre the content of an <url> tag, if the date in the <lastmod> tag is older than the given datetime.date?

With a result like this:

?xml version="1.0" encoding="UTF-8"?>
  <url>
    <lastmod>2020-02-04T16:21:00+01:00</lastmod>
    <loc>https://www.h.com</loc>
  </url>

Can somebody help?

gython
  • 865
  • 4
  • 18
  • You need to compare time , [Python time comparison](https://stackoverflow.com/questions/1831410/python-time-comparison) would be helpful – CC7052 Feb 06 '20 at 16:56

2 Answers2

1

Is this all right?

import time
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<?xml version="1.0" encoding="UTF-8"?>
<url>
  <lastmod>2020-02-04T16:21:00+01:00</lastmod>
  <loc>https://www.h.com</loc>
</url>
<url>
  <lastmod>2020-01-31T17:17:00+01:00</lastmod>
  <loc>https://www.h.com</loc>
</url>
<url>
  <lastmod>2020-01-27T13:53:00+01:00</lastmod>
  <loc>https://www.h.coml</loc>
</url>
'''
doc = SimplifiedDoc(html)
urls = doc.urls
startTime = time.strptime("2020-2-1", "%Y-%m-%d")
removeList=[]
for url in urls:
  lastmod = url.lastmod.html # Get lastmod
  tm = time.strptime(lastmod[0:lastmod.find('+')], "%Y-%m-%dT%H:%M:%S")
  if tm<startTime:
    removeList.append(url)
n = len(removeList)
html = doc.html
while n>0: # Delete data in reverse order
  n-=1
  url = removeList[n]
  html = html[0:url._start]+html[url._end:] # Delete url data
print (html.strip())

Result:

<?xml version="1.0" encoding="UTF-8"?>
<url>
  <lastmod>2020-02-04T16:21:00+01:00</lastmod>
  <loc>https://www.h.com</loc>
</url>
dabingsou
  • 2,469
  • 1
  • 5
  • 8
0

If you are using python >=3.7, you can convert time string (for convinence named below as your_date_string)to time in the following way:

datetime.strptime(your_date_string, '%Y-%m-%dT%H:%M:%S%z')

if it's older python version, you need to remove last colon from the timezone

if your_date_string[-3] == ':': 
    your_date_string = your_date_string[:-3]+ your_date_string[-2:]
magma
  • 180
  • 2
  • 6