2

I typically work with large XML files, and generally do word counts via grep to confirm certain statistics.

For example, I want to make sure I have at least five instances of widget in a single xml file via:

cat test.xml | grep -ic widget

Additionally, I just like to be able to log the line that widget appears on, ie:

cat test.xml | grep -i widget > ~/log.txt

However, the key information I really need is the block of XML code that widget appears in. An example file may look like:

<test> blah blah
  blah blah blah
  widget
  blah blah blah
</test>

<formula>
  blah
  <details> 
    widget
  </details>
</formula>

I am trying to get the following output from the sample text above, ie:

<test>widget</test>

<formula>widget</formula>

Effectively, I'm trying to get a single line with the highest level of markup tags that apply to a block of XML text/code that is surrounding the arbitrary string, widget.

Does anyone have any suggestions for implementing this via a command-line one liner?

Thank you.

Cloud
  • 18,753
  • 15
  • 79
  • 153
  • 1
    Take a look at [this post](http://stackoverflow.com/questions/2222150/extraction-of-data-from-a-simple-xml-file). Perhaps you get some idea. – mtk Jul 20 '12 at 23:19

4 Answers4

3

A non-elegant way using both sed and awk:

sed -ne '/[Ww][Ii][Dd][Gg][Ee][Tt]/,/^<\// {//p}' file.txt | awk 'NR%2==1 { sub(/^[ \t]+/, ""); search = $0 } NR%2==0 { end = $0; sub(/^<\//, "<"); printf "%s%s%s\n", $0, search, end }'

Results:

<test>widget</test>
<formula>widget</formula>

Explanation:

## The sed pipe:

sed -ne '/[Ww][Ii][Dd][Gg][Ee][Tt]/,/^<\// {//p}'
## This finds the widget pattern, ignoring case, then finds the last, 
## highest level markup tag (these must match the start of the line)
## Ultimately, this prints two lines for each pattern match

## Now the awk pipe:

NR%2==1 { sub(/^[ \t]+/, ""); search = $0 }
## This takes the first line (the widget pattern) and removes leading
## whitespace, saving the pattern in 'search'

NR%2==0 { end = $0; sub(/^<\//, "<"); printf "%s%s%s\n", $0, search, end }
## This finds the next line (which is even), and stores the markup tag in 'end'
## We then remove the slash from this tag and print it, the widget pattern, and
## the saved markup tag

HTH

Steve
  • 51,466
  • 13
  • 89
  • 103
2
 sed -nr '/^(<[^>]*>).*/{s//\1/;h};/widget/{g;p}' test.xml

prints

<test>
<formula>

Sed only one-liner would be more complex if printed the exact format you want.

EDIT:
You could use /widget/I instead of /widget/ for case-insensitive matches of widget in gnu sed, otherwise use [Ww] for every letter as in the other answer.

nshy
  • 1,074
  • 8
  • 13
2

This might work for you (GUN sed):

sed '/^<[^/]/!d;:a;/^<\([^>]*>\).*<\/\1/!{$!N;ba};/^<\([^>]*>\).*\(widget\).*<\/\1/s//<\1\2<\/\1/p;d' file
potong
  • 55,640
  • 6
  • 51
  • 83
1

Needs gawk to have regexp in RS

BEGIN {
    # make a stream of words
    RS="(\n| )"
}

# match </tag>
/<\// {
    s--
    next
}

# match <tag>
/</ {
    if (!s) {
    tag=substr($0, 2)
    }
    s++
}

$0=="widget" {
    print "<" tag $0 "</" tag
}
slitvinov
  • 5,693
  • 20
  • 31