I have a 30 sitemap files look like below:
<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.A.com/a</loc>
<lastmod>2013-08-01</lastmod>
<changefreq>weekly</changefreq>
<priority>0.6</priority>
</url>
<url>
<loc>http://www.A.com/b</loc>
<lastmod>2013-08-01</lastmod>
<changefreq>weekly</changefreq>
<priority>0.6</priority>
</url>
...
</urlset>
The output I want four columns each row for each url tag, print out to screen
http://www.A.com/a 2013-08-01 weekly 0.6
http://www.A.com/b 2013-08-01 weekly 0.6
The way that I am using is Python BeautifulSoup to parse the tag out, however, the performance is horribly slow since there are 30+ files there and 300,000 lines per file. I am wondering would it be possible that use some shell AWK or SED to do that or.. I am just using the wrong tools to do that.
Since the sitemap is so well formatted, there might be some regular expression tricks to get around it.
Any one have experience dividing records/rows in AWK or SED by multiple lines instead of new line character?
Thanks a lot!