-1

I have the following xml table from a sitemap.xml url (image attached): sitemap.xml file from website

I am trying to parse out from the xml the URL by searching for these two criterias if they exist in that URL:

  1. "Average-Weather-in-"
  2. "-Year-Round"

Then return the URL

so in this case I'd be expecting a URL of https://weatherspark.com/y/8004/Average-Weather-in-Austin-Texas-United-States-Year-Round

WQureshi
  • 19
  • 3

3 Answers3

1

To find a specific part of HTML or XML documents you can parse them into a DOM data structure and then traverse it programmatically. Even better, you can use XPath to find specific parts efficiently.

See also

Queeg
  • 7,748
  • 1
  • 16
  • 42
1

One solution might be using XML/HTML parser such as beautifulsoup:

import requests
from bs4 import BeautifulSoup

# change to your URL:
url = 'https://weatherspark.com/sitemap-271.xml'

soup = BeautifulSoup(requests.get(url).content, 'xml')

for loc in soup.select('loc'):
    text = loc.text
    if 'Average-Weather-in-' in text and '-Year-Round' in text:
        print(text)

Prints:

...

https://weatherspark.com/y/138371/Average-Weather-in-Mulanay-Philippines-Year-Round
https://weatherspark.com/y/138372/Average-Weather-in-Mangero-Philippines-Year-Round
https://weatherspark.com/y/138373/Average-Weather-in-Malibago-Philippines-Year-Round
https://weatherspark.com/y/138374/Average-Weather-in-Madulao-Philippines-Year-Round
https://weatherspark.com/y/138375/Average-Weather-in-Macalelon-Philippines-Year-Round

...
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • I just learned about CSS selectors to use in the soup.select() method However I can't find 'loc' from the CSS selectors, care to explain? https://www.w3schools.com/cssref/css_selectors.php – WQureshi Mar 15 '23 at 04:21
0

If you are using xsl to transform the XML would use contains(), which takes 2 parameters - the XPath to the target and the text that you are looking for

<xsl:value-of select="contains('url/loc','Average-Weather-in')" />

This will return the url.

Bryn Lewis
  • 580
  • 1
  • 5
  • 14