0

So the problem is I am trying to use AWK, Perl to find how many records are inside one xml that is one loooong line sometimes in the megabytes.

Most if not all examples I've seen are assuming a nice structured xml like

      <?xml version="1.0" encoding="UTF-8"?>
      <spendownrequest xmlns="http://www.foo.com/Adv/HR/SSt">
            <spenddowndata>
                            <employeeId>0002</employeeId>
                            <transactionId>103</transactionId>
                            <transactionType>T</transactionType>                            
            </spenddowndata>
            <spenddowndata>
                            <employeeId>0003</employeeId>
                            <transactionId>104</transactionId>
                            <transactionType>T</transactionType>
            </spenddowndata>
            <spenddowndata>
                            <employeeId>0004</employeeId>
                            <transactionId>105</transactionId>
                            <transactionType>T</transactionType>
            </spenddowndata>
      </spendownrequest>

with newlines at each row. These files are like this

<?xml version="1.0" encoding="UTF-8"?><spendownrequest xmlns="http://www.foo.com/Adv/HR/SSt"> <spenddowndata><employeeId>0002</employeeId><transactionId>103</transactionId> <transactionType>T</transactionType></spenddowndata><spenddowndata><employeeId>0003</employeeId> <transactionId>104</transactionId><transactionType>T</transactionType></spenddowndata><spenddowndata> <employeeId>0005</employeeId><transactionId>105</transactionId><transactionType>T</transactionType> </spenddowndata></spendownrequest>

One long line with only (1) newline at the end.

I tried:

awk -F'[<|>]' '/spenddowndata/ {i++} { print i }' file.xml

get back 1

How would I get the count for all 3 that are in this file?

Mike Laren
  • 8,028
  • 17
  • 51
  • 70
  • 2
    I would strongly urge you to use a programming language that has a real XML parser available. You could do this easily in Python or Ruby and the white space wouldn't matter. If you are unlucky and you have an object whose tags can appear inside the object, you won't be able to parse it with regular expressions. – steveha Dec 20 '14 at 00:19
  • 1
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – beny23 Dec 20 '14 at 00:28
  • ... or use tool which uses a real XML parser. There are command-line tools which use XPath or XQuery to retrieve data from XML files, for example. Yes, XML was designed so it can be manipulated as text if absolutely necessary -- one of the design principles was the concept of the DPH, or Desperate Perl Hacker, who didn't have anything better available and was forced to approach it that way -- but XML really is MUCH easier to handle if you use the mechanisms designed for the purpose. Especially if the file uses some of the less-obvious features of this syntax. Don't reinvent wheels! – keshlam Dec 20 '14 at 00:28
  • Since Perl is an option for you you can try using `XML::Simple` or `XML::Parser` in Perl. These modules may be already installed on your machine. – nwk Dec 21 '14 at 16:28
  • Yes this works...Thank You Very useful for verification of total count of records. Now I want to parse out a specific value for each of the found patterns demarkation point. i.e `12345` 12345 is the goal..the next piece is to modify 12345 to user value input...say 56789...I think AWK will be the way to go. Im trying to limit the amount of memory I use to parse items out of file. I am dealing with files that are 300-400MB in size and trying to not load the whole file into memory(java XML paring) and then walk through each element to find 12345. – Derik Jarne Jan 17 '15 at 00:02

4 Answers4

0
awk 'BEGIN {RS="<"; count = 0;} { if ($0 ~ /^spenddowndata*/) {count++}} END {print(count);}'

Should work?

AlpineCoder
  • 627
  • 4
  • 8
0

With grep:

grep -o '</spenddowndata>' f | wc -l

With awk (in fact gawk (Thank you @EdMorton)):

gawk -v RS='</spenddowndata>' 'END{print NR-1}' f   

With perl:

perl -n0E 's!</spenddowndata>!$i++!ge; say $i+0'
JJoao
  • 4,891
  • 1
  • 18
  • 20
  • 1
    you should state that's gawk-specific for the multi-char RS and you need to make the END `print (NR?NR-1:0)` or you'll output `-1` for empty files. – Ed Morton Dec 20 '14 at 03:31
  • > **gawk-specific** : thank you; edited ; > **empty files**: thank you but XML can not be an empty file; I prefer my current version. – JJoao Dec 20 '14 at 11:53
  • I see files all the time that end in `.xml` and are empty, usually generated from some tool. Not sure why you'd say they cannot be empty YMMV with that idea. – Ed Morton Dec 20 '14 at 16:07
  • This is clearly a specification problem: we don't have the full details of the corners of the problem, so we have to choose a specification. XML can not be empty. IMHO, If we have tools that generate Maybe-XML (empty or XML) I prefer to have a distinction between XML with zero spenddowndata items (returning 0) and empty return -1 (for situation where the tool wants to report errors). I obviously understand and accept your problem specification. – JJoao Dec 20 '14 at 16:54
  • shell tools do not print `-1` when errors occur, they exit with status non-zero. I could definitely see tweaking a tool to exit with status `-1` if the file was empty and in your domain it's invalid to have an empty file, but you would not get any standard UNIX tool printing `-1` for an empty or otherwise invalid file, it's just wrong to do that. – Ed Morton Dec 20 '14 at 16:58
  • Thank you. I completely agree that exit status -1 is the best in this situation. I confess that my view of the Maybe-monad and unix tools is more liberal. – JJoao Dec 20 '14 at 17:30
0

You can also store the pattern in a file, say pat.awk:

BEGIN{
    FPAT = "(<spenddowndata>)"
}

{
    print NF
}

To display count, run :

awk -f pat.awk file.xml
0
awk -F'</spenddowndata>' 'END{print (NF?NF-1:0)}' file

The ternary condition testing for NF is to avoid printing -1 for an empty file.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185