parse XML pairs with bash

Question

I've bumped myself into an interesting (at least for me) problem. Let's take an xml file:

<a>pair1a</a>
<b>pair1b</b>
<c>randomtext</c>
<a>pair2a</a>
<b>pair2b</b>
...

the  tag goes always after <a> tag. What I want to get is the contents between <a> and  saved and associated together. How should I approach this problem in bash so that later I could easily access and manage data? I thought about associative arrays or putting everything in one array and separating a contents from b's with somekind of delimiter (althought this might be tricky). My approach was rather simple as in greping everything out into two arrays and then having them to use single index (btw, I've got used to perl regex and that's what grep is using). Can this be done simplier?

a_Array=$(curl --silent -L $xml | grep -oP '(?<=<a>).*?(?=</a>)')
b_Array=$(curl --silent -L $xml | grep -oP '(?<=<b>).*?(?=</b>)')

I'm aware of such tool, although not sure whether it would change the code structure, would it? — psukys, May 16 '13 at 14:36
Nevermind, I've misunderstood your problem. That's a pretty odd xml btw. What's wrong with your code? — emi, May 16 '13 at 14:42
it's not much of a problem, more like a question of advice of how could the structure be rearranged. — psukys, May 16 '13 at 14:44

score 1 · Accepted Answer · edited May 23 '17 at 11:43

XML cannot be parsed properly with shell means. There's a very nice text about this topic.

Having that said, there might be exceptions to the rule. For one, if your input is not any arbitrary XML but an XML of a specific format, you might be able to parse it using grep etc.

In your example I guess the elements <a>...</a> and ... each never have attributes, each are never abbreviated as <a/> when empty, each span exactly one line and always follow each other. Also I guess we can assume that no [CDATA[...]] or similar stuff will appear in your XML in which in turn there might be something looking like your elements. Finally, we assume that there are no whitespace uglinesses in you input (sth like < a >).

If all this is the case you simply can grep for '^<a>' and '^', yes. You also might find grep's options -A and -B useful, for instance in:

cat my.xml | grep -A 1 '^<a>'

This will print all lines starting with <a> and each line following such a line. -B can be used to include lines before the ones matching the regexp.

parse XML pairs with bash

1 Answers1