0

I need to search a directory that has hundreds or thousands of files, each containing XML with one or more instances of a specific string (begin/end tag with data). I can get all the instances of the string by doing

grep -ho '<mytagname>..............<\/mytagname>' /home/xyzzy/mydata/*.XML > /home/mydata/tagvalues.txt

then a few sed commands to strip off the tags, so I wind up with a file just containing a list of values:

  value001
  value002
  value003

(etc)

Ideally though, I'd like to have each line of the file to also include the filename so I can import into a database for analysis.

So my result would be something like this

fileAAA value001
fileAAA value002
fileAAA value003
fileBBB value004

Exact formatting of the above is flexible - could have spaces or other separator, it could even still include the begin/end tags.

The closest I've been able to get is with grep -o

fileAAA:value001
value002
value003
fileBBB:value004

A perl one-liner would seem ideal but I'm new enough to that, that I have no clue how to begin.

tshepang
  • 12,111
  • 21
  • 91
  • 136
JOATMON
  • 89
  • 2
  • 11

2 Answers2

4

Could be done using a one-liner like so:

perl -lne 'print "$ARGV $1" if /<mytagname>(.*?)<\/mytagname>/' *.xml

However, I'd strongly recommend that you use an actual XML parser like XML::Twig or XML::LibXML

use strict;
use warnings;

use XML::LibXML;

for my $file (</home/xyzzy/mydata/*.XML>) {
    my $doc = XML::LibXML->load_xml(location => $file);
    for my $node ($doc->findnodes("//mytagname")) {
        print "$file " . $node->textContent() . "\n";
    }
}
Miller
  • 34,962
  • 4
  • 39
  • 60
0

What about awk?

awk -F'</?mytagname>' '$2 {print FILENAME,$2}' /home/xyzzy/mydata/*.XML

Explanation:

  • -F regex - set field delimiter must be a separate argument thus enclosed in its own quotes
  • $2 - if second field has a value
  • {print FILENAME,$2} - print filename SPACE the value of second field
anttix
  • 7,709
  • 1
  • 24
  • 25
  • Thanks - both of those (the perl and the awk) work to some extent: they appear to only take the first occurrence of the string in the file. When I use my original grep, I'm getting many thousands of hits (even after I sort and take unique values). When I use either of the commands as is, I get about 7500 hits, which is the number of files in the directory. – JOATMON Apr 01 '14 at 21:03
  • Aha - did a little digging and found the answer in another posting [here](http://stackoverflow.com/questions/19031552/perl-one-liner-to-match-all-occurrences-of-regex) - so I changed the perl command to while( /(.*?)<\/mytagname>/g)' instead of the if - and that's giving me a more believable number. – JOATMON Apr 01 '14 at 21:26