1

I'm trying to create a bash script to parse an xml file and save it to a csv file.

For example:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <List>
    <Job id="1" name="John/>
    <Job id="2" name="Zack"/>
    <Job id="3" name="Bob"/>
</List>

I would like the script to save information into a csv file as such:

John | 1
Zack | 2
Bob  | 3

The name and id will be in a different cell.

Is there any way I can do this?

user3259914
  • 11
  • 1
  • 1
  • 3
  • Might have just edited the old question (http://stackoverflow.com/q/21495533/3076724) rather than posting a new one, but you should definitely at least link to it when posting similar questions. – Reinstate Monica Please Feb 02 '14 at 06:43
  • Duplicate: https://stackoverflow.com/questions/14368347/convert-xml-file-to-csv-in-shell-script – Vanuan Oct 26 '17 at 00:26

4 Answers4

5

You've posted a query similar to your pervious one. I'd again suggest using a XML parser. You could say:

xmlstarlet sel -t -m //List/Job -v @name -o "|" -v @id -n file.xml

It would return

John|1
Zack|2
Bob|3

for your sample data.

Pipe the output to sed: sed "s/|/\t| /" if you want it to appear as in your example.

Community
  • 1
  • 1
devnull
  • 118,548
  • 33
  • 236
  • 227
2

Try something like this

#!/bin/bash
while read -r line; do
  [[ $line =~ "name=\""(.*)"\"" ]] && name="${BASH_REMATCH[1]}" && [[ $line =~ "Job id=\""([^\"]+) ]] &&  echo "$name | ${BASH_REMATCH[1]}"
done < file 

The line with John is malformed. With it fixed, example output

John | 1
Zack | 2
Bob | 3
Reinstate Monica Please
  • 11,123
  • 3
  • 27
  • 48
  • 1
    in this instance `name="John/>`, there is no double quota after John, so recommend to replace `[[ $line =~ "name=\""(.*)"\"" ]]` to `[[ $line =~ "name=\""([^\"|/]*) ]]` – BMW Feb 03 '14 at 05:24
  • 2
    @BMW Thanks. I assumed it shouldn't be malformed xml, but if it is could do that or something like `([A-Za-z]*)` – Reinstate Monica Please Feb 03 '14 at 05:33
  • dude, can u elaborate on that short script? I am quite confused. :) nevertheless its looking crazy good. – Dominik May 02 '16 at 11:47
2

Extending xmlstarlet approach:

Given this xml file as input:

<DATA>
  <RECORD>
    <NAME>John</NAME>
    <SURNAME>Smith</SURNAME>
    <CONTACTS>
      "Smith" LTD,
      London, Mtg Str, 12,
      UK
    </CONTACTS>
  </RECORD>
</DATA>

And this script:

xmlstarlet sel -e utf-8 -t \
  -o "NAME, SURNAME, CONTACTS" -n \
  -m //DATA/RECORD \
  -o "\"" \
  -v $"str:replace(normalize-space(NAME), '\"', '\"\"')" -o "\",\"" \
  -v $"str:replace(normalize-space(SURNAME),      '\"', '\"\"')" -o "\",\"" \
  -v $"str:replace(normalize-space(CONTACTS), '\"', '\"\"')" -o "\",\"" \
  -o "\"" \
  -n file.xml

You'll have the following output:

NAME, SURNAME, CONTACTS
"John", "Smith", """Smith"" LTD, London, Mtg Str, 12, UK"
Vanuan
  • 31,770
  • 10
  • 98
  • 102
  • This is a good solution, and elegant. Just I got: compilation error: element with-param XSLT-with-param: Failed to compile select expression 'str:replace' because of unclosed parenthesis in normalize-space call; should read "str:replace(normalize-space(NAME) , '\"', '\"\"')" – Diego1974 Aug 29 '19 at 09:21
  • Thanks for this. Anyone else extracting URLs from XML may find the `&` isn't escaped. Fix this by adding `-T` after the `sel` command, e.g. `xmlstarlet sel -T -e utf-8......` (see https://stackoverflow.com/questions/46255304/unescape-the-ampersand-via-xmlstarlet-bugging-amp) – Neek Mar 11 '22 at 06:29
1

Using sed

sed -nr 's/.*id=\"([0-9]*)\"[^\"]*\"(\w*).*/\2 | \1/p' file

Additional, base on BroSlow's cript, I merge the options.

#!/bin/bash

while read -r line; do
  [[ $line =~ id=\"([0-9]+).*name=\"([^\"|/]*) ]] && echo "${BASH_REMATCH[2]} | ${BASH_REMATCH[1]}"
done < file
BMW
  • 42,880
  • 12
  • 99
  • 116