0

I am trying to use regex to get the url from the text file. And I am taking XML in the form of .txt format My text file is locations.txt. This is the text file

This XML file does not appear to have any style information associated with it. The document tree is shown below. 
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

<sitemap> 
<loc>https://www.apple.com/jp/shop/sitemap-index.xml</loc>  </sitemap>
 <sitemap>
 <loc>https://www.apple.com/ph/shop/sitemap-index.xml</loc>
</sitemap> 
<sitemap>
 <loc>https://www.apple.com/hk-zh/shop/sitemap-index.xml</loc>
 </sitemap> <sitemap> <loc>https://www.apple.com/kr/shop/sitemap-      index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/nz/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/th/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/sg/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/au/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/my/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/tw/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/cn/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/hk/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/uk/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/be-nl/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/it/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/lu/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/hu/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/at/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/cz/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/fi/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/tr/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/de/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/es/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/ie/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/pl/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/se/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/ae/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/be-fr/shop/sitemap-index.xml</loc> </sitemap> <sitemap> <loc>https://www.apple.com/dk/shop/sitemap-index.xml</loc> </sitemap> <sitemap>

The script I am using :

import re
re.findall('<(loc)>(https?://)([^\s]+)(</\1>)', open('locations.txt', 'r').read())

But there is no output.

emon
  • 35
  • 9
  • 4
    **1.** "I am trying to use regex to get the url from the text file" should be fixed to "I am trying to use regex to get the url from XML file", and **2.** Don't use regex to parse XML files. Use proper XML parsers. – DeepSpace Jul 23 '17 at 11:38
  • @DeepSpace what if it is in the text format and not in xml format ? – emon Jul 23 '17 at 11:39
  • @emon what do you mean? What is the difference between those except the file extension? Take a look if the [`ElementTree`](https://docs.python.org/3/library/xml.etree.elementtree.html) module can help you. – EsotericVoid Jul 23 '17 at 12:00
  • @Bit Okay ! that's cool . But what if i want to use regex. Can anything be done here? – emon Jul 23 '17 at 12:03

1 Answers1

0
  1. Do not use regular expressions to parse XML files (or any hierarchical/nested structures)

  2. DO NOT USE REGULAR EXPRESSIONS TO PARSE XML FILES

  3. If you insist to pick up content between the nearest <loc> and </loc>tags with regex:

.

import re

with open("locations.txt", 'r') as f:
    locations = re.findall(r"<loc>(https?\S+)</loc>", f.read())
    # ['https://www.apple.com/jp/shop/sitemap-index.xml',
    #  'https://www.apple.com/ph/shop/sitemap-index.xml',
    #  'https://www.apple.com/hk-zh/shop/sitemap-index.xml', ...]

Yours fails mostly because you're not escaping your backslashes, but even if you did you'd be getting three more groups instead of just the URL.

zwer
  • 24,943
  • 3
  • 48
  • 66