0

I've bumped myself into an interesting (at least for me) problem. Let's take an xml file:

<a>pair1a</a>
<b>pair1b</b>
<c>randomtext</c>
<a>pair2a</a>
<b>pair2b</b>
...

the <b> tag goes always after <a> tag. What I want to get is the contents between <a> and <b> saved and associated together. How should I approach this problem in bash so that later I could easily access and manage data? I thought about associative arrays or putting everything in one array and separating a contents from b's with somekind of delimiter (althought this might be tricky). My approach was rather simple as in greping everything out into two arrays and then having them to use single index (btw, I've got used to perl regex and that's what grep is using). Can this be done simplier?

a_Array=$(curl --silent -L $xml | grep -oP '(?<=<a>).*?(?=</a>)')
b_Array=$(curl --silent -L $xml | grep -oP '(?<=<b>).*?(?=</b>)')
choroba
  • 231,213
  • 25
  • 204
  • 289
psukys
  • 387
  • 2
  • 6
  • 20

1 Answers1

1

XML cannot be parsed properly with shell means. There's a very nice text about this topic.

Having that said, there might be exceptions to the rule. For one, if your input is not any arbitrary XML but an XML of a specific format, you might be able to parse it using grep etc.

In your example I guess the elements <a>...</a> and <b>...</b> each never have attributes, each are never abbreviated as <a/> when empty, each span exactly one line and always follow each other. Also I guess we can assume that no [CDATA[...]] or similar stuff will appear in your XML in which in turn there might be something looking like your elements. Finally, we assume that there are no whitespace uglinesses in you input (sth like < a >).

If all this is the case you simply can grep for '^<a>' and '^<b>', yes. You also might find grep's options -A and -B useful, for instance in:

cat my.xml | grep -A 1 '^<a>'

This will print all lines starting with <a> and each line following such a line. -B can be used to include lines before the ones matching the regexp.

Community
  • 1
  • 1
Alfe
  • 56,346
  • 20
  • 107
  • 159