In bash, how do I parse a string to remove everything except what is between two tags?

Question

The user who tagged this as a duplicate missed the forest for the trees, and their suggested duplicate does not answer this question sufficiently.

Here's a sample of what this string might be:

<mobile_device><general><id>15</id><device_name>iPad</device_name><name>Timmy</name><asset_tag/><id>16</id><device_name>iPhone</device_name><name>Spike</name><asset_tag/></general></mobile_device>

I want to parse this somehow to only end up with:

<id>15</id><id>16</id>

So, remove everything that's not contained between an opening id tag and a closing id tag, and there could potentially be an infinite amount of tags. (Although a more realistic upper limit edge case would be 60,000) There will always be at least 1 pair of tags though.

I've been playing around with sed for this, but variations of this syntax haven't worked at all:

sed 's/.*\(<id>*</id>\).*//'

Many thanks in advance for any guidance!

Greg Nisbet · Answer 1 · 2017-03-15T02:21:31.687

Assuming your data is in input.xml, here's a way using xmllint and a simple XPath query

$ cat input.xml | xmllint --xpath '//id' -
<id>15</id><id>16</id>

Here's something quick and dirty you can use to extract just the info between <id>...</id> if xmllint or a more appropriate tool isn't available.

$ cat input.xml | perl -pe 's/(<.?id.)/\n$1/g' | grep '^<id>' | sed -e 's/$/<\/id>/'

sed is fundamentally line-oriented, and it's hard to perform a substitution that includes a newline. tr on the other hand is fundamentally character-oriented. If we use perl to insert newlines in strategic places, then we can filter out just the lines that begin with <id> and add the matching </id> back again.

using xmllint --format is also a good low-complexity way to convert xml into pretty-printed xml which is easier to rip apart with line oriented tools if you can't get the xpath query right.

$ cat input.xml | xmllint --format - | grep '^\s*<id>'

score 1 · Accepted Answer · answered Mar 15 '17 at 02:53

1

with sed it could look like this ...

echo "$STRING" | sed 's/<\/id>.*<id>/<\/id><id>/;s/<mobile_device><general>//;s/<device_.*_device>//;'

Output will look like this ...

<id>15</id><id>16</id>

how it works:
every thing between </id> and <id> gets removed via sed 's/<\/id>.*<id>/<\/id><id>/' .

then the <mobile_device> and <general> gets renoved via sed 's/<mobile_device><general>//'.

last but not least every thing between <device_name ... mobile_device> gets removed via sed 's/<device_.*_device>//'.

Hope this helps.

answered Mar 15 '17 at 02:53

Mario

679
6
10

Hey @suleiman thanks for that answer, with some easy modification this is going to work great :) I didn't list the entire XML file I'm parsing through, but I'll just add all of the extra tags on the back end. Thanks! – iMatthewCM Mar 20 '17 at 16:58
your welcome. glad I could help. – Mario Mar 20 '17 at 17:18

score 0 · Answer 3 · answered Mar 15 '17 at 02:45

Your sed string looks like it is close to working, here are some adjustments:

sed 's=.*\(<id>.*</id>\).*=\1='

You need to pick a delimiting character that does not appear in the command expression. / is used in the closing </id>, so I used '=' instead.
Then * modifies the immediately proceeding Regular Expression to mean "0 or more". You had it following a >, which means '0 or more closin brackets'. The . represents any single characters and is what you really should use, so the parenthesized expression should now match an entire <id> field.
Finally, the \1 indicates that where you want the results of the first parenthesized subexpression to be placed in the result string.

This has some limitations for a general solution, but if you know there is only one ID field per line, it should serve.

score 0 · Answer 4 · answered Mar 15 '17 at 05:46

Another in awk. Define both RSand ORS to > and read between the markers <id and </id:

$ awk 'BEGIN{RS=ORS=">"} /<id/,/<\/id/' file
<id>15</id><id>16</id>$

As ORS is > you need to add the final newline manually with printf:

$ awk 'BEGIN{RS=ORS=">"} /<id/,/<\/id/; END{printf "\n"}' file
<id>15</id><id>16</id>
$

score 0 · Answer 5 · answered Mar 15 '17 at 06:14

0

gawk can be a bit simpler:

awk '{print RT}' RS='<id>[^>]+>'

answered Mar 15 '17 at 06:14

grail

914
6
14

While this code snippet may solve the question, [including an explanation](http://meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – DimaSan Mar 15 '17 at 11:16

score -1 · Answer 6 · answered Mar 15 '17 at 02:20

-1

if you have gawk

$ awk -v RS='</?id>' -v ORS='' '!(NR%2) {print pRT $0 RT} 
                                        {pRT=RT} 
                                 END    {printf "\n"}' file

of course you can hard code the tags in the print statements and remove RTs.

answered Mar 15 '17 at 02:20

karakfa

66,216
7
41
56

which part of that isn't portable to other awks? – Greg Nisbet Mar 15 '17 at 02:26
multi-char `RS` is not supported by many `awk`s – karakfa Mar 15 '17 at 02:27

In bash, how do I parse a string to remove everything except what is between two tags?

6 Answers6