Bash script that downloads an RSS feed and saves each entry as a separate html file

Question

I'm trying to create a bash script that downloads an RSS feed and saves each entry as a separate html file. Here's what I've been able to create so far:

curl -L https://news.ycombinator.com//rss > hacke.txt

grep -oP '(?<=<description>).*?(?=</description>)' hacke.txt | sed 's/<description>/\n<description>/g' | grep '<description>' | sed 's/<description>//g' | sed 's/<\/description>//g' | while read description; do
  title=$(echo "$description" | grep -oP '(?<=<title>).*?(?=</title>)')
  if [ ! -f "$title.html" ]; then
    echo "$description" > "$title.html"
  fi
done

Unfortunately, it doesn't work at all :( Please suggest me where my mistakes are.

Please explain what you plan to do in the second part of your script. — Cyrus, Dec 17 '22 at 17:04
I want to extract from the feed everything contained in the variable, and then save each time it occurs in a new file that will extract the name from the variable — zeddie, Dec 17 '22 at 19:26

Reino · Answer 1 · 2022-12-18T12:47:51.793

Please suggest me where my mistakes are.

Your single mistake is trying to parse XML with regular expressions. You can't parse XML/HTML with RegEx! Please use an XML/HTML-parser like xidel instead.

The first <item>-element-node (not "variable" as you call them):

$ xidel -s "https://news.ycombinator.com/rss" -e '//item[1]' \
  --output-node-format=xml --output-node-indent
<item>
  <title>Show HN: I made an Ethernet transceiver from logic gates</title>
  <link>https://imihajlov.tk/blog/posts/eth-to-spi/</link>
  <pubDate>Sun, 18 Dec 2022 07:00:52 +0000</pubDate>
  <comments>https://news.ycombinator.com/item?id=34035628</comments>
  <description>&lt;a href=&quot;https://news.ycombinator.com/item?id=34035628&quot;&gt;Comments&lt;/a&gt;</description>
</item>

$ xidel -s "https://news.ycombinator.com/rss" -e '//item[1]/description'
<a href="https://news.ycombinator.com/item?id=34035628">Comments</a>

Note that while the output of the first command is XML, the output for the second command is ordinary text!

With the integrated EXPath File Module you could then save this text(!) to an HTML-file:

$ xidel -s "https://news.ycombinator.com/rss" -e '
  //item/file:write-text(
    replace(title,"[<>:&quot;/\\\|\?\*]",())||".html",   (: remove invalid characters :)
    description
  )
'

But you can also save it as proper HTML by parsing the <description>-element-node and using file:write() instead:

$ xidel -s "https://news.ycombinator.com/rss" -e '
  //item/file:write(
    replace(title,"[<>:&quot;/\\\|\?\*]",())||".html",
    parse-html(description),
    {"indent":true()}
  )
'

$ xidel -s "Show HN I made an Ethernet transceiver from logic gates.html" -e '$raw'
<html>
  <head/>
  <body>
    <a href="https://news.ycombinator.com/item?id=34035628">Comments</a>
  </body>
</html>

Nice +1 ! I didn't even try to create the files directly from XPath; in my answer I just used XPath for translating the XML to TSV — Fravadona, Dec 18 '22 at 12:46

Bash script that downloads an RSS feed and saves each entry as a separate html file

1 Answers1