0

How to convert following XML tag into text with pipe delimited file using awk or sed. I tried with following awk but it didn't return full text from Content type tag. Any help would great.

Input_file.dat

        <entry>
            <updated>2014-05-17T16:34:00-07:00</updated>
                <id>994568497</id>
                <title>No longer usable</title>
                <content type="text">I happen to like the new look, but it crashes with each attempt to use it to perform any real action. Fix it quickly please!.</content>
                <im:contentType term="Application" label="Application"/>
                <im:voteSum>0</im:voteSum>
                <im:voteCount>0</im:voteCount>
                <im:rating>1</im:rating>
                <im:version>4.2.0.165</im:version>
                <author><name>Arcdouble</name><uri>https://test.com/us/reviews/id199894255</uri></author>
        </entry>

Expected output_file.csv format

|2014-05-17T16:34:00-07:00|994568497|No longer usable|I happen to like the new look, but it crashes with each attempt to use it to perform any real action. Fix it quickly please!.|1|Arcdouble|https://test.com/us/reviews/id199894255|
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
user3347931
  • 319
  • 1
  • 2
  • 9
  • 1
    You'd have better luck with something like XSLT or at least an XML parser such as the ElementTree module that comes with Python than with awk or sed. They were designed for working with records (organized fields of information) or lines respectively, not hierarchical structures such as those found in XML. –  May 18 '14 at 20:15
  • Yes,that's right but I'm trying to work using bash script and tried with following command it returns the value but some time it truncate the text message. `awk -F'[<>]' '{ORS = "|"};\ / "output_file.csv" };\ / "output_file.csv" };\ / "output_file.csv" };\ /<content d="" print="" type="text">> "output_file.csv" } ' Input_file.dat`</content> – user3347931 May 18 '14 at 20:26
  • 2
    Please use a proper xml parser, there are many good ones available in any language of your choice. – gniourf_gniourf May 18 '14 at 20:42
  • [`xmlstarlet`](http://xmlstar.sourceforge.net/) would be able to transform this. I would provide an answer, but you're not showing the xml namespaces. – glenn jackman May 18 '14 at 21:11

1 Answers1

1

The code below should work for you:

perl -ne '/<\/entry>/ && print "\n"; />(.*?)</ && !/<name>/  && print $1."|"; /<name>/ && /name>?(.*?)<\/.*?(uri>?)(.*)?<\/uri/ && print $1."|".$3'

Input:

tiago@dell:~$ cat file
        <entry>
            <updated>2014-05-17T16:34:00-07:00</updated>
                <id>994568497</id>
                <title>No longer usable</title>
                <content type="text">I happen to like the new look, but it crashes with each attempt to use it to perform any real action. Fix it quickly please!.</content>
                <im:contentType term="Application" label="Application"/>
                <im:voteSum>0</im:voteSum>
                <im:voteCount>0</im:voteCount>
                <im:rating>1</im:rating>
                <im:version>4.2.0.165</im:version>
                <author><name>Arcdouble</name><uri>https://test.com/us/reviews/id199894255</uri></author>
        </entry>
        <entry>
            <updated>2014-05-17T16:34:00-07:00</updated>
                <id>994568497</id>
                <title>No longer usable</title>
                <content type="text">I happen to like the new look, but it crashes with each attempt to use it to perform any real action. Fix it quickly please!.</content>
                <im:contentType term="Application" label="Application"/>
                <im:voteSum>0</im:voteSum>
                <im:voteCount>0</im:voteCount>
                <im:rating>1</im:rating>
                <im:version>4.2.0.165</im:version>
                <author><name>Arcdouble</name><uri>https://test.com/us/reviews/id199894255</uri></author>
        </entry>

Execution:

tiago@dell:~$ cat file | perl -ne '/<\/entry>/ && print "\n"; />(.*?)</ && !/<name>/  && print $1."|"; /<name>/ && /name>?(.*?)<\/.*?(uri>?)(.*)?<\/uri/ && print $1."|".$3' 
2014-05-17T16:34:00-07:00|994568497|No longer usable|I happen to like the new look, but it crashes with each attempt to use it to perform any real action. Fix it quickly please!.|0|0|1|4.2.0.165|Arcdouble|https://test.com/us/reviews/id199894255
2014-05-17T16:34:00-07:00|994568497|No longer usable|I happen to like the new look, but it crashes with each attempt to use it to perform any real action. Fix it quickly please!.|0|0|1|4.2.0.165|Arcdouble|https://test.com/us/reviews/id199894255
Tiago Lopo
  • 7,619
  • 1
  • 30
  • 51
  • 2
    Don't parse xml with regexps. Please. – gniourf_gniourf May 18 '14 at 20:43
  • Sometimes we just need to get the job done with one-liner, but thanks for the suggestion :) – Tiago Lopo May 18 '14 at 20:49
  • 3
    No, don't parse xml with regexps. Please. Just don't. Don't even argue that you need the job done, because this is highly broken from the start. Just don't parse xml with regexps. Trust me. And since you're using Perl, use a proper parser, e.g., LibXML. – gniourf_gniourf May 18 '14 at 21:09
  • I will take your advice, don't want you to have heart attack :) – Tiago Lopo May 18 '14 at 21:10
  • 1
    [Please instruct yourself about this horrible thing](http://blog.codinghorror.com/parsing-html-the-cthulhu-way/) [before you mention anything about my health](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – gniourf_gniourf May 18 '14 at 21:18
  • Thanks for the link and sorry for mentioning your health. I thought a bit of sense of humor would not hurt anyone. – Tiago Lopo May 18 '14 at 21:19
  • 2
    My answer is _actually_ in a humor mood `:)`. Make sure you follow both links (especially the 2nd one is excellent). – gniourf_gniourf May 18 '14 at 21:21
  • Replied without actually following the links, got your sense of humor now LOL. – Tiago Lopo May 18 '14 at 21:28
  • I ran the command and found some text data not matching and returns Don't as Don't. Is there way to fix this issue. – user3347931 May 18 '14 at 21:45
  • What is the meaning of `` and `` ? – Pooja Jadav Jul 10 '19 at 06:28