0

I have this code:

cat response_error.xml | sed  -ne  's#\s*<[^>]*>\s*##gp'  >> response_error.csv

but all sed match from xml are bonded, for exemple:

084521AntonioCallas 

I want to get this effect

084521,Antonio,Callas, 

is it possible?

I must write a script which collect XML documents from previous day, extract from them only data without <...> and save this information to csv file in this way: 084521,Antonio,Callas - information separated by commas. The XML look like this:

<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
<s:Body xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<GenerarInformeResponse xmlns="http://experian.servicios.CAIS">
<GenerarInformeResult>
<InformeResumen xmlns="http://experian.servicios.CAIS.V2">
<IdSuscriptor>084521</IdSuscriptor>
<ReferenciaConsulta>Antonio Callas 00000000</ReferenciaConsulta>
<Error>
<Codigo>0000</Codigo>
<Descripcion>OK</Descripcion>
</Error>
<Documento>
<TipoDocumento>
<Codigo>01</Codigo>
<Descripcion>NIF</Descripcion>
</TipoDocumento>
<NumeroDocumento>000000000</NumeroDocumento>
<PaisDocumento>
<Codigo>000</Codigo>
<Descripcion>ESPAÑA</Descripcion>
</PaisDocumento>
</Documento>
<Resumen>
<Nombre>
<Nombre1>XXX</Nombre1>
<Nombre2>XXX</Nombre2>
<ApellidosRazonSocial>XXX</ApellidosRazonSocial>
</Nombre>
<Direccion>
<Direccion>XXX</Direccion>
<NombreLocalidad>XXX</NombreLocalidad>
<CodigoLocalidad/>
<Provincia>
<Codigo>39</Codigo>
<Descripcion>XXX</Descripcion>
</Provincia>
<CodigoPostal>39012</CodigoPostal>
</Direccion>
<NumeroTotalOperacionesImpagadas>1</NumeroTotalOperacionesImpagadas>
<NumeroTotalCuotasImpagadas>0</NumeroTotalCuotasImpagadas>
<PeorSituacionPago>
<Codigo>6</Codigo>
<Descripcion>XXX</Descripcion>
</PeorSituacionPago>
<PeorSituacionPagoHistorica>
<Codigo>6</Codigo>
<Descripcion>XXX</Descripcion>
</PeorSituacionPagoHistorica>
<ImporteTotalImpagado>88.92</ImporteTotalImpagado>
<MaximoImporteImpagado>88.92</MaximoImporteImpagado>
<FechaMaximoImporteImpagado>
<DD>27</DD>
<MM>03</MM>
<AAAA>2019</AAAA>
</FechaMaximoImporteImpagado>
<FechaPeorSituaiconPagoHistorica>
<DD>27</DD>
<MM>03</MM>
<AAAA>2019</AAAA>
</FechaPeorSituaiconPagoHistorica>
<FechaAltaOperacionMasAntigua>
<DD>16</DD>
<MM>12</MM>
<AAAA>2015</AAAA>
</FechaAltaOperacionMasAntigua>
<FechaUltimaActualizacion>
<DD>27</DD>
<MM>03</MM>
<AAAA>2019</AAAA>
</FechaUltimaActualizacion>
</Resumen>
</InformeResumen>
</GenerarInformeResult>
</GenerarInformeResponse>
</s:Body>
</s:Envelope>   
Bartek
  • 11
  • 3
  • 1
    The thing is `sed` isn't matching `084521`, `Antonio` and `Callas`, it's matching the xml tag openings and closings that contain the strings you want, and replacing them with nothing. You could possibly have it replace them wih `,` first then replace multiple consecutive commas with a single one, but you'd better use an xml parser such as xmlstarlet – Aaron Apr 03 '19 at 08:40
  • 2
    Some call it [summoning the daemon](https://www.metafilter.com/86689/), others refer to it as the [Call for Cthulhu](https://blog.codinghorror.com/parsing-html-the-cthulhu-way/) and few [just turned mad and met the Pony](https://stackoverflow.com/a/1732454/8344060). In short, never parse XML or HTML with a regex! Did you try an xmlparser such as `xmlstarlet`, `xmllint` or `xsltproc`? – kvantour Apr 03 '19 at 08:43
  • As @Aaron mentioned, `xmlstarlet` is the tool for the job. As an aside, the pattern matching questions should contain a sample of the pattern and the expected output.. – sjsam Apr 03 '19 at 08:44
  • Welcome to Stack Overflow. While we are willing to help, it is hard for us to understand what you really want. We require some input like how your `response_error.xml` looks like and what you really want to achieve. This is an [mcve]. Also, have a look at [ask] and don't be afraid to take the [tour]. – kvantour Apr 03 '19 at 09:05
  • You need to use a code block to add XML to your question ; it's the `{}` button in the editor (click it while your XML is selected), otherwise you can do it by prefacing each XML line by 4 spaces – Aaron Apr 03 '19 at 09:30
  • @Aaron exactly, every informations are bonded but I need separate them by commas. – Bartek Apr 03 '19 at 09:32
  • We get that, but a robust answer to your question would need to be based on an XML parser, and for that we'd need a sample of your XML. I can see in the edit history you've tried to include one but we'd need to see the whole XML ideally, at least the tag hierarchy that leads to the data you need. – Aaron Apr 03 '19 at 09:34
  • @Aaron I add the XML, thank you for helping me – Bartek Apr 03 '19 at 09:43
  • Thanks, this is great. Can you also check which xml parser you have installed on your system? The answer will be similar whatever parser you have available, but we might as well use the one you'll have. You can test that using `type xmlstarlet xmllint xsltproc`. – Aaron Apr 03 '19 at 09:46
  • @Aaron I have installed xmllint and xsltproc – Bartek Apr 03 '19 at 09:49

3 Answers3

0

You can extract the IdSuscriptor using the following command :

xmllint --xpath '//*[local-name()="IdSuscriptor"]/text()' response_error.xml

And the ReferenciaConsulta using the following command :

xmllint --xpath '//*[local-name()="ReferenciaConsulta"]/text()' response_error.xml

To produce the desired IdSubscriptor,FirstName,LastName I would use the following script :

id_suscriptor=$(xmllint --xpath '//*[local-name()="IdSuscriptor"]/text()' response_error.xml)
referencia_consulta=$(xmllint --xpath '//*[local-name()="IdSuscriptor"]/text()' response_error.xml)
first_name=$(echo "$referencia_consulta" | cut -f1)
last_name=$(echo "$referencia_consulta" | cut -f2)
echo "$id_suscriptor,$first_name,$last_name"

Note that this assumes the ReferenciaConsulta field will always contain a string starting with the first name and last name separated with a space.

Aaron
  • 24,009
  • 2
  • 33
  • 57
  • I have some errors when I wanna execute the code: response_error.xml:3: parser error : Extra content at the end of the document – Bartek Apr 03 '19 at 10:10
  • Ah, maybe your file is a sequence of `` tags? – Aaron Apr 03 '19 at 10:35
  • Hmm I don't think xmllint is equipped to handle that (technically it's an invalid XML document). I hope to find a more robust solution but I guess for now you can use `sed -nE '//N;s/([^<]*).*([^ ]*) ([^ ]*).*/\1,\2,\3/p' response_error.xml` – Aaron Apr 03 '19 at 10:54
0

If you want to parse XML, use a dedicated XML parser like Saxon.

If you want to parse a strange text file with some funny unrelated angle brackets, try this:

#! /bin/sed -nf

s/^<IdSuscriptor>\([0-9]\+\)<\/IdSuscriptor>/\1,/
t match1
b next

: match1
h
b

: next
s/^<ReferenciaConsulta>\([^ ]\+\) \([^ ]\+\) [0-9]\+<\/ReferenciaConsulta>/\1,\2,/
t match2
b

: match2
H
g
s/\n//
p

Explanation

t jumps to match1, if the preceeding s command did a replacement. Otherwise b jumps to next.

In case of a match h copies the matching string into the hold space and b stops the processing of the current line.

The second s command works the same way with the difference, that in case of no match b continues with the next line.

In case of the second match H appends the pattern space to the hold space, g copies the hold space to the pattern space, s removes the newline between the two matches and p prints the result.

Conclusion

If you do not know how to do it with sed don't try it. Try to learn a real programming language like Perl or JavaScript or Python. sed is a relic of bygone times.

ceving
  • 21,900
  • 13
  • 104
  • 178
0

if your data in 'd' file, try gnu sed:

sed -Ez 's/<[^>]*>//g;s/\n+|\s+/,/g;' d