How to parse extra attribute from rss xml with read

Question

I want to parse data from Jackett, initially I tried with flexget but I need to extract data that is not present on various plug-ins, so I started with this little script in order to try to parse those extra data. My rss is some like this

<?xml version="1.0" encoding="UTF-8"?>
<rss version="1.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:torznab="http://torznab.com/schemas/2015/feed">
  <channel>
    <atom:link href="http://jackett:9117/" rel="self" type="application/rss+xml" />
    <title>site description</title>
    <description>A general italian tracker</description>
    <link>https://site.some/</link>
    <language>en-us</language>
    <category>search</category>
    <image>
      <url>http://jackett:9117/logos/site.png</url>
      <title>site description</title>
      <link>https://site.some/</link>
      <description>site description</description>
    </image>
    <item>
      <title>Pinnacle Studio Ultimate v23 0 1 177 64 Bit Content Pack</title>
      <guid>https://site.some/index.php?page=torrent-details&amp;id=id</guid>
      <jackettindexer id="site">site description</jackettindexer>
      <comments>https://site.some/index.php?page=torrent-details&amp;id=id</comments>
      <pubDate>Mon, 26 Aug 2019 18:47:48 +0200</pubDate>
      <size>4778150912</size>
      <grabs>4</grabs>
      <description />
      <link>http://jackett:9117/dl/site/?jackett_apikey=apikey&amp;path=Q2ZESjhIOTlRbnNBaTlsTXBueG41dVNtYWFqVjlsbTFockNDVXRieE5OYXRQYTdnclc4Zmc2dGJVNlFiQ01SVW9Wbm9yblJaZnhWXy0wSnVocHRISGxkYmNQLVQ5aWh6S1RORWtqMmwzMTlvTUFNZHlrV1c2czBlbjhNczlFa3VuQ1RxVjRsTkM0UGxRc2RUYzllR0tJaTBVMFFtMWc0UHIybnl0eFVkbGZqcUxuR1BPRDN0MGYwWUNNcVZ5d3NWazgta0Z0SkdrUUZIYnpZZWpUOTA1V2F5b1JGMEpTWlZVSzN0bVkzYzFMU09BLTlBck54bERpRU0yZ3lNTzkwcDU3amhNWE1MOXZmWFhLSEJaa1gwWEpWMHFYUFRfMFMtSlJQX05oalRMNmtpTlc4S0NueDF6c1VZazZfTkg0bE1IZFF5cEE&amp;file=Pinnacle+Studio+Ultimate+v23+0+1+177+64+Bit+Content+Pack</link>
      <category>4010</category>
      <category>100007</category>
      <enclosure url="http://jackett:9117/dl/site/?jackett_apikey=apikey&amp;path=Q2ZESjhIOTlRbnNBaTlsTXBueG41dVNtYWFqVjlsbTFockNDVXRieE5OYXRQYTdnclc4Zmc2dGJVNlFiQ01SVW9Wbm9yblJaZnhWXy0wSnVocHRISGxkYmNQLVQ5aWh6S1RORWtqMmwzMTlvTUFNZHlrV1c2czBlbjhNczlFa3VuQ1RxVjRsTkM0UGxRc2RUYzllR0tJaTBVMFFtMWc0UHIybnl0eFVkbGZqcUxuR1BPRDN0MGYwWUNNcVZ5d3NWazgta0Z0SkdrUUZIYnpZZWpUOTA1V2F5b1JGMEpTWlZVSzN0bVkzYzFMU09BLTlBck54bERpRU0yZ3lNTzkwcDU3amhNWE1MOXZmWFhLSEJaa1gwWEpWMHFYUFRfMFMtSlJQX05oalRMNmtpTlc4S0NueDF6c1VZazZfTkg0bE1IZFF5cEE&amp;file=Pinnacle+Studio+Ultimate+v23+0+1+177+64+Bit+Content+Pack" length="4778150912" type="application/x-bittorrent" />
      <torznab:attr name="category" value="4010" />
      <torznab:attr name="category" value="100007" />
      <torznab:attr name="seeders" value="4" />
      <torznab:attr name="peers" value="6" />
      <torznab:attr name="minimumratio" value="1" />
      <torznab:attr name="minimumseedtime" value="172800" />
      <torznab:attr name="downloadvolumefactor" value="1" />
      <torznab:attr name="uploadvolumefactor" value="1" />
    </item>
  </channel>
</rss>

So initially my first idea was to parse each section in order to extract info, so I came up with this

#!/bin/bash

xmlgetnext () {
   local IFS='>'
   read -d '<' TAG VALUE
}

# /data/Varie/Scripts/mmm


cat /data/Varie/Scripts/mmm | while xmlgetnext ; do
   case $TAG in
      'item')
         title=''
         link=''
         description=''
         downloadvolumefactor=''
         ;;
      'title')
         title="$VALUE"
         ;;
      'link')
         link="$VALUE"
         ;;
      'downloadvolumefactor')
         downloadvolumefactor="$VALUE"
         ;;
      '/item')
         cat<<EOF
------------------------------
Title: $title
Link: $link
Custom value: $downloadvolumefactor
------------------------------
EOF
         ;;
      esac
done

So read start after first < and read till the next < then set TAG and VALUE

Till here is ok to me, the problem is I can't find a way to extract downloadvolumefactor, because the value is not formatted like standard.

My very first idea is to modify the rss before parse it, so maybe I can transform with a replacing regex

<torznab:attr name="uploadvolumefactor" value="1" />

into

<downloadvolumefactor>1</downloadvolumefactor>

Do you have a better idea?

Some call it [summoning the daemon](https://www.metafilter.com/86689/), others refer to it as [the Call for Cthulhu](https://blog.codinghorror.com/parsing-html-the-cthulhu-way/) and few just [turned mad and met the Pony](https://stackoverflow.com/a/1732454/8344060). In short, never parse XML or HTML with a regex! Did you try an XML parser such as `xmlstarlet`, `xmllint` or `xsltproc`? — kvantour, Aug 27 '19 at 07:57

score 1 · Accepted Answer · answered Aug 27 '19 at 09:46

Here is a simple awk (standard Linux gnu awk or gawk) script that solve the problem scanning the input file as pure text.

script.awk

match($0,"<title>[^<]*", arr) {title=substr(arr[0],8)}   # read title line
match($0,"<link>[^<]*", arr) {link=substr(arr[0],7)}     # read link line
match($0,/uploadvolumefactor" value="[^"]/, arr) {valueFactor=substr(arr[0],28)} # read valueFactor line
/<\/item>/ { # ouput values on item element termination
    print "------------------------------";
    print "Title: "title;
    print "Link: "link;
    print "Custom value: "valueFactor;
    print "------------------------------";
}

running:

awk -f script.awk input.xml

Provided the input.xml in the question.

output:

------------------------------
Title: Pinnacle Studio Ultimate v23 0 1 177 64 Bit Content Pack
Link: http://jackett:9117/dl/site/?jackett_apikey=apikey&amp;path=Q2ZESjhIOTlRbnNBaTlsTXBueG41dVNtYWFqVjlsbTFockNDVXRieE5OYXRQYTdnclc4Zmc2dGJVNlFiQ01SVW9Wbm9yblJaZnhWXy0wSnVocHRISGxkYmNQLVQ5aWh6S1RORWtqMmwzMTlvTUFNZHlrV1c2czBlbjhNczlFa3VuQ1RxVjRsTkM0UGxRc2RUYzllR0tJaTBVMFFtMWc0UHIybnl0eFVkbGZqcUxuR1BPRDN0MGYwWUNNcVZ5d3NWazgta0Z0SkdrUUZIYnpZZWpUOTA1V2F5b1JGMEpTWlZVSzN0bVkzYzFMU09BLTlBck54bERpRU0yZ3lNTzkwcDU3amhNWE1MOXZmWFhLSEJaa1gwWEpWMHFYUFRfMFMtSlJQX05oalRMNmtpTlc4S0NueDF6c1VZazZfTkg0bE1IZFF5cEE&amp;file=Pinnacle+Studio+Ultimate+v23+0+1+177+64+Bit+Content+Pack
Custom value: 1
------------------------------

Thanks great tips, I'm searching a pure bash solution without install other software. — Jorman Franzini, Aug 27 '19 at 17:24

KamilCuk · Answer 2 · 2019-08-27T07:41:36.203

Use xml aware tools.

xmllint --xpath 'string(//*[name()="torznab:attr" and @name="downloadvolumefactor"]/@value)' /data/Varie/Scripts/mmm

Will return:

Do not parse xml files using regexes.

If you have to, when you have to, it would be just easier to filter it with awk or sed or grep with cut and similar:

sed -nr '/.*<torznab:attr name="uploadvolumefactor" value="([^"]*).*/s//\1/p' /data/Varie/Scripts/mmm

Bash while read loops are extremely slow, it's better to use other tools. If the file format is steady and you can't get xmllint or other xml aware tool, I would go with preparsing it with sed - read one line, extract information from it, add to hold space, continue reading&parsing up until </item> is encountered. But using xml-aware tools will be way secure and less error prone.

How to parse extra attribute from rss xml with read

2 Answers2