Parse Sitemap Quickly

Question

I have a 30 sitemap files look like below:

<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
    <loc>http://www.A.com/a</loc>
    <lastmod>2013-08-01</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.6</priority>
</url>
<url>
    <loc>http://www.A.com/b</loc>
    <lastmod>2013-08-01</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.6</priority>
</url>
...
</urlset>

The output I want four columns each row for each url tag, print out to screen

http://www.A.com/a 2013-08-01 weekly 0.6
http://www.A.com/b 2013-08-01 weekly 0.6

The way that I am using is Python BeautifulSoup to parse the tag out, however, the performance is horribly slow since there are 30+ files there and 300,000 lines per file. I am wondering would it be possible that use some shell AWK or SED to do that or.. I am just using the wrong tools to do that.

Since the sitemap is so well formatted, there might be some regular expression tricks to get around it.

Any one have experience dividing records/rows in AWK or SED by multiple lines instead of new line character?

Thanks a lot!

score 2 · Accepted Answer · edited May 23 '17 at 10:31

I definitely wouldn't suggest regular expressions as a general way of parsing arbitrary XML or HTML, but since you said this is so well-formed the usual warning can probably be ignored in this case:

sed -n '/^<url>$/{n;N;N;N;s/\n/ /g;s/ *<[a-z]*>//g;s/<\/[a-z]*>/ /g;p}'

Here is a commented version that explains what is going on:

sed -n '/^<url>$/ {  # if this line contains only <url>
  n;N;N;N              # read the next 4 lines into the pattern space
  s/\n//g              # remove newlines
  s/ *<[a-z]*>//g      # remove opening tags and the spaces before them
  s/<\/[a-z]*>/ /g     # replace closing tags with a space
  p                    # print the pattern space
}' test.txt

The -n option suppresses the automatic printing of the pattern space.

Dude you ROCK, can you explain a little bit of your crazy regular expression? — B.Mr.W., Aug 15 '13 at 22:22

Ed Morton · Answer 2 · 2013-08-16T11:57:31.920

1

sed is an excellent tool for simple substitutions on a single line, for anything else just use awk:

$ awk -F'[<>]' '
    /^<\/url>/ { inUrl=0; print line }
    inUrl      { line = line (line?" ":"") $3 }
    /^<url>/   { inUrl=1; line="" }
' file
http://www.A.com/a 2013-08-01 weekly 0.6
http://www.A.com/b 2013-08-01 weekly 0.6

edited Aug 16 '13 at 11:57

answered Aug 16 '13 at 01:39

Ed Morton

188,023
17
78
185

potong · Answer 3 · 2013-08-16T07:23:41.110

1

This might work for you (GNU sed):

sed '/^<url>/!d;:a;N;/<\/url>/!ba;s/<[^>]*>\s*<[^>]*>/ /g;s/^ \| $//g' file

Gathers up url lines in the pattern space, replaces tags by spaces and removes leading and trailing spaces. All other lines are deleted.

If you know there will only be 4 lines between the url tags:

sed '/^<url>/!d;N;N;N;N;s/<[^>]*>\s*<[^>]*>/ /g;s/^ \| $//g' file

edited Aug 16 '13 at 07:23

answered Aug 16 '13 at 07:17

potong

55,640
6
51
83

Parse Sitemap Quickly

3 Answers3