1

I've an xml file and am searching for a string in this file. Once (and if) the string is found I need to be able to search back to the position of another string and output the data.

ie:

<xml>
<packet>
 <proto>
 <field show="bob">
 </proto>
</packet>
<packet>
 <proto>
 <field show="rumpelstiltskin">
 </proto>
</packet>
<packet>
 <proto>
 <field show="peter">
 </proto>
</packet>

My input would be known:

show="rumpelstiltskin" 

and

<packet>

I need to get the following result (which is basically the second block);

<packet>
<proto>
<field show="rumpelstiltskin">
</proto>
</packet>

or

<packet>
<proto>
<field show="rumpelstiltskin">

There are thousands of (wireshark pdml conversion) and the show="rumpelstilstkin" can occur anywhere in the file and the section can be of any arbitrary size.

I've done this before and am pretty sure it's possible in an awk or sed oneliner.. any help appreciated!

The HCD
  • 492
  • 8
  • 18

5 Answers5

2

You could do this with grep

cat file | grep 'show="rumpelstiltskin"' -B5 | grep 'otherstring'

Obviously adjust -B5 to how many lines you need to retain the string you are looking for.

Geoffrey
  • 10,843
  • 3
  • 33
  • 46
  • yea, was thinking of this.. even tac using awk but it's not exact in some cases where the lines are highly variable. basically it's a converted wireshark trace, and I just need to get the framenumber of a particular frame where a string occurs. – The HCD Nov 16 '16 at 18:14
2

You need to treat your XML as XML and use an appropriate tool. For example, modifying your XML slightly to make it valid:

<xml>
  <packet>
    <proto>
      <field show="bob"/>
    </proto>
  </packet>
  <packet>
    <proto>
      <field show="rumpelstiltskin"/>
    </proto>
  </packet>
  <packet>
    <proto>
      <field show="peter"/>
    </proto>
  </packet>
</xml>

You could use xmllint like this:

xmllint --xpath '//packet[proto/field/@show="rumpelstiltskin"]' file.xml

This matches and prints the contents of all <packet> elements that contain a <field show="rumpelstiltskin"> within a <proto> element.

If you don't want to specify the complete hierarchy, you can use something like this instead:

xmllint --xpath '//packet[descendant::field[@show="rumpelstiltskin"]]' file.xml
Tom Fenech
  • 72,334
  • 12
  • 107
  • 141
  • 1
    Nice. Another XML processing tool: `xmlstarlet sel -t -c '//packet[descendant::field[@show="'"$show"'"]]' file.xml` – glenn jackman Nov 16 '16 at 18:41
  • definately the way I'd like to go, however, got this error with xmllint not sure if it likes large files.. # xmllint --xpath '//packet[descendant::field[@show="12192620160920141757196"]]' file.pdml Unknown option --xpath [code] # xmllint '//packet[descendant::field[@show="12192620160920141757196"]]' file.pdml warning: failed to load external entity "/packet[descendant::field[@show="12192620160920141757196"]]" Killed [/code] – The HCD Nov 16 '16 at 18:42
  • Hmm, that error looks more like the `--xpath` option isn't supported, which sounds strange. Not sure if it's related but it looks like you have `""` in your command. – Tom Fenech Nov 16 '16 at 18:48
  • yea, I guess it's an old version. Will test it later, thanks! `# xmllint --version xmllint: using libxml version 20706 compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib # cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.6 (Santiago)` yea, corrected the "" also... still same prob. – The HCD Nov 16 '16 at 18:53
  • @TheHCD looks like [the issue is covered here](http://stackoverflow.com/q/11975862/2088135) – Tom Fenech Nov 16 '16 at 19:09
1

So ... you COULD hack something together that would do basic parsing of your file as a text file...

awk -v txt="rumpel" '$0=="<packet>"{s=$0; found=0; next} $0~txt{found=1} {s=s RS $0} $0=="</packet>" && found {print s}' inp.xml

Broken out into pieces for easier explanation, this does the following:

  • -v txt="rumpel" - sets a variable for use within the script. Note that this will be evaluated as a regex in this example, but you could use index() if you prefer to search for it as a string.
  • $0=="<packet>"{s=$0; found=0; next} - If we find the start of a packet, reset our storage variable (s) and flag (found).
  • $0~txt{found=1} - If we find the text we're looking for, set a flag.
  • {s=s RS $0} - Append the current line to a variable, and
  • $0=="</packet>" && found {print s} - if we're at the end of our text and the string was found, print.

A better approach would likely be to interpret the XML using something that understands XML natively, but that isn't possible with just sed and awk.

ghoti
  • 45,319
  • 8
  • 65
  • 104
  • sweet! it works. However, wondering how to stop it after finding the first occurance of the search txt string rather than beat on through an 200M file.. And, yea, I've tried with xmllint, but I think the wireshark pdml output format is not compatible with it... (well at least I didn't get it working) – The HCD Nov 16 '16 at 18:36
  • Yes, well, the example you posted was indeed invalid XML. The `` was not closed. Should be ``. Also, no closing ``. An easy way to stop after the first found record would be to add `;exit` after the `print s`. Or, if you're processing multiple files on one command line and your awk supports it, `;nextfile`. – ghoti Nov 17 '16 at 04:04
1

If your inputs really that simple all you need is:

$ awk '/<packet>/{buf=""} {buf=buf $0 RS} /rumpelstiltskin/{printf "%s",buf}' file
<packet>
 <proto>
 <field show="rumpelstiltskin">

or if you prefer:

$ awk '/<packet>/{buf="";f=0} {buf=buf $0 RS} /rumpelstiltskin/{f=1} f&&/<\/packet>/{printf "%s",buf}' file
<packet>
 <proto>
 <field show="rumpelstiltskin">
 </proto>
</packet>

and if you want to stop reading the input file after the first print then just add ;exit after it so printf "%s",buf becomes printf "%s",buf; exit.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

This might work for you (GNU sed):

sed '/<packet>/h;//!H;/rumpelstiltskin/!d;x;q' file

This stores the required strings in the hold space, prints them out and quits.

However to be sure the first and second strings exist and are adjacent to one another:

sed '/<packet>/h;//!H;/rumpelstiltskin/!d;x;/<packet>.*rumpelstiltskin/!d;q' file
potong
  • 55,640
  • 6
  • 51
  • 83