What is the best way to clean a dirty file?

Question

I took a very dirty xml file to study a little sed. Behold here:

 <title><![CDATA[O BR-Linux está em pausa por tempo indeterminado]]></title>
 <title><![CDATA[Funçoes ZZ atinge maioridade: versão 18.3]]></title>
 <title><![CDATA[CloudFlare 1.1.1.1 e parceria com Firefox DoH]]></title>
 <title><![CDATA[Slint, Distro Baseada no Slackware e Acessível]]></title>
 <title><![CDATA[Utilização de CPU em sistemas Linux multi-thread]]></title>
 <title><![CDATA[Realidade Aumentada com 10 anos de idade e 10 linhas de código.]]></title>

I managed to remove the garbage, and just keep the text. However, the solution did not please me very much. I would like a way to improve this, but I really don't know how. Here is the code:

#!/bin/bash

# Trauvin

URL=http://br-linux.org/feed/

lynx -source "$URL" |
    grep '<title><!' |             # get tag title
    sed 's/<[^!>]*>//g' |          # remove tag title           
    sed 's/<[^<]>*//g' |           # remove <!
    sed 's/CDATA/''/g' |           # remove CDATA 
    sed 's/[[^[]//g' |             # remove the square brackets start 
    sed 's/[]*]]//g' |             # remove the squre brackets end
    sed 's/>*//g' |                # remove > end
    head -n 5

I used several sed's for no more confusion, so I can add comments on all lines.

So you are looking to improve some working code? Or does it have any issues? — takendarkk, Aug 26 '20 at 10:30
Then stackoverflow is the wrong site for you since this site is for debugging problems. Try posting on [CodeReview](https://codereview.stackexchange.com/) for improvements to working code. Make sure to review their guidelines before posting. — takendarkk, Aug 26 '20 at 10:36
Don't use regex to parse XML. Use a true XML-parser like [tag:xidel] instead: `xidel -s "https://br-linux.org/feed/" -e '//item/title'` — Reino, Aug 29 '20 at 11:52

score 3 · Answer 1 · answered Aug 26 '20 at 11:01

With xmlstarlet:

URL='http://br-linux.org/feed/'
lynx -source "$URL" | xmlstarlet select --template --value-of '//item/title'

Output:

O BR-Linux está em pausa por tempo indeterminado
Funçoes ZZ atinge maioridade: versão 18.3
CloudFlare 1.1.1.1 e parceria com Firefox DoH
Slint, Distro Baseada no Slackware e Acessível
Utilização de CPU em sistemas Linux multi-thread
Realidade Aumentada com 10 anos de idade e 10 linhas de código.
Nova versão da plataforma livre para o mapeamento de iniciativas em agroecologia
Instalação do WordPress com Vagrant
DatabaseCast 82: Ciência e dados
Aplicando ferramentas open source para se dar bem no jogo Suikoden Tierkreis
Tchelinux 2018: Chamada de palestras para Rio Grande
Palestra on-line - conhecendo o Elastic Stack
Curso gratuito básico de linux - Online e ao-vivo
Aulas Particulares de Programação em Shell Script
Protoboard em quadrinhos: manual apresenta 10 circuitos divertidos e desafiadores  que você mesmo pode construir

I always forget about `xmlstarlet`... way nicer than writing a stylesheet by hand. — Shawn, Aug 26 '20 at 11:05

Shawn · Answer 2 · 2020-08-26T11:07:06.853

The best way to work with an XML file is to use XML-aware tools, not regular expressions.

Example using XSLT to extract just the titles:

feed.xslt:

<?ml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
  <xsl:for-each select="rss/channel/item">
    <xsl:value-of select="title"/><xsl:text>&#xA;</xsl:text>
  </xsl:for-each>
</xsl:template>
</xsl:stylesheet>

When applied to the RSS feed:

$ xsltproc feed.xslt <(curl -s https://br-linux.org/feed/)
O BR-Linux está em pausa por tempo indeterminado
Funçoes ZZ atinge maioridade: versão 18.3
CloudFlare 1.1.1.1 e parceria com Firefox DoH
Slint, Distro Baseada no Slackware e Acessível
Utilização de CPU em sistemas Linux multi-thread
Realidade Aumentada com 10 anos de idade e 10 linhas de código.
Nova versão da plataforma livre para o mapeamento de iniciativas em agroecologia
Instalação do WordPress com Vagrant
DatabaseCast 82: Ciência e dados
Aplicando ferramentas open source para se dar bem no jogo Suikoden Tierkreis
Tchelinux 2018: Chamada de palestras para Rio Grande
Palestra on-line - conhecendo o Elastic Stack
Curso gratuito básico de linux - Online e ao-vivo
Aulas Particulares de Programação em Shell Script
Protoboard em quadrinhos: manual apresenta 10 circuitos divertidos e desafiadores  que você mesmo pode construir

score 1 · Answer 3 · answered Aug 26 '20 at 10:59

you could 'unwrap' the content step by step rather than separating end from start:

$ lynx -source "$URL" |
sed 's/<title>\(.*\)<\/title>/\1/' | # <title>x</title> -> x
sed 's/<!\[\(.*\)\]>/\1/' |  # <![x]> -> x
sed 's/CDATA\[\(.*\)\]/\1/' | # CDATA[x] -> x

What is the best way to clean a dirty file?

3 Answers3