0

The user who tagged this as a duplicate missed the forest for the trees, and their suggested duplicate does not answer this question sufficiently.

Here's a sample of what this string might be:

<mobile_device><general><id>15</id><device_name>iPad</device_name><name>Timmy</name><asset_tag/><id>16</id><device_name>iPhone</device_name><name>Spike</name><asset_tag/></general></mobile_device>

I want to parse this somehow to only end up with:

<id>15</id><id>16</id>

So, remove everything that's not contained between an opening id tag and a closing id tag, and there could potentially be an infinite amount of tags. (Although a more realistic upper limit edge case would be 60,000) There will always be at least 1 pair of tags though.

I've been playing around with sed for this, but variations of this syntax haven't worked at all:

sed 's/.*\(<id>*</id>\).*//'

Many thanks in advance for any guidance!

iMatthewCM
  • 449
  • 2
  • 10
  • 21

6 Answers6

1

Assuming your data is in input.xml, here's a way using xmllint and a simple XPath query

$ cat input.xml | xmllint --xpath '//id' -
<id>15</id><id>16</id>

Here's something quick and dirty you can use to extract just the info between <id>...</id> if xmllint or a more appropriate tool isn't available.

$ cat input.xml | perl -pe 's/(<.?id.)/\n$1/g' | grep '^<id>' | sed -e 's/$/<\/id>/'

sed is fundamentally line-oriented, and it's hard to perform a substitution that includes a newline. tr on the other hand is fundamentally character-oriented. If we use perl to insert newlines in strategic places, then we can filter out just the lines that begin with <id> and add the matching </id> back again.

using xmllint --format is also a good low-complexity way to convert xml into pretty-printed xml which is easier to rip apart with line oriented tools if you can't get the xpath query right.

$ cat input.xml | xmllint --format - | grep '^\s*<id>'
Greg Nisbet
  • 6,710
  • 3
  • 25
  • 65
1

with sed it could look like this ...

echo "$STRING" | sed 's/<\/id>.*<id>/<\/id><id>/;s/<mobile_device><general>//;s/<device_.*_device>//;'

Output will look like this ...

<id>15</id><id>16</id>

how it works:
every thing between </id> and <id> gets removed via sed 's/<\/id>.*<id>/<\/id><id>/' .

then the <mobile_device> and <general> gets renoved via sed 's/<mobile_device><general>//'.

last but not least every thing between <device_name ... mobile_device> gets removed via sed 's/<device_.*_device>//'.

Hope this helps.

Mario
  • 679
  • 6
  • 10
  • Hey @suleiman thanks for that answer, with some easy modification this is going to work great :) I didn't list the entire XML file I'm parsing through, but I'll just add all of the extra tags on the back end. Thanks! – iMatthewCM Mar 20 '17 at 16:58
  • your welcome. glad I could help. – Mario Mar 20 '17 at 17:18
0

Your sed string looks like it is close to working, here are some adjustments:

sed 's=.*\(<id>.*</id>\).*=\1='
  • You need to pick a delimiting character that does not appear in the command expression. / is used in the closing </id>, so I used '=' instead.

  • Then * modifies the immediately proceeding Regular Expression to mean "0 or more". You had it following a >, which means '0 or more closin brackets'. The . represents any single characters and is what you really should use, so the parenthesized expression should now match an entire <id> field.

  • Finally, the \1 indicates that where you want the results of the first parenthesized subexpression to be placed in the result string.

This has some limitations for a general solution, but if you know there is only one ID field per line, it should serve.

Greg Tarsa
  • 1,622
  • 13
  • 18
0

Another in awk. Define both RSand ORS to > and read between the markers <id and </id:

$ awk 'BEGIN{RS=ORS=">"} /<id/,/<\/id/' file
<id>15</id><id>16</id>$

As ORS is > you need to add the final newline manually with printf:

$ awk 'BEGIN{RS=ORS=">"} /<id/,/<\/id/; END{printf "\n"}' file
<id>15</id><id>16</id>
$
James Brown
  • 36,089
  • 7
  • 43
  • 59
0

gawk can be a bit simpler:

awk '{print RT}' RS='<id>[^>]+>'
grail
  • 914
  • 6
  • 14
  • While this code snippet may solve the question, [including an explanation](http://meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – DimaSan Mar 15 '17 at 11:16
-1

if you have gawk

$ awk -v RS='</?id>' -v ORS='' '!(NR%2) {print pRT $0 RT} 
                                        {pRT=RT} 
                                 END    {printf "\n"}' file

of course you can hard code the tags in the print statements and remove RTs.

karakfa
  • 66,216
  • 7
  • 41
  • 56