Command line combine files at change in part of name and part of file

Question

I am on AIX, with bash, and we cannot install additional software at this time so I am very limited to command line batch processing and maybe custom java scripts. So, I have a ton of XML files in different directories. Here is what a subset may look like.

root_dir
   Pages
      PAGES_1.XML
   Queries
      QUERIES_1.XML
      QUERIES_2.XML
      QUERIES_3.XML

I have put together a script that gets me almost everything I want, but I don't know how to do the last piece of the puzzle if possible in a batch script. I create a new directory under root, copy all of the XML files into the new directory, and then I rename them to remove any spaces if there are any in the name, and buffer the integer so they can be sorted in alphabetical / numerical order. The new output looks like this:

copy_dir
    PAGES_001.XML
    QUERIES_001.XML
    QUERIES_002.XML
    QUERIES_003.XML

I am almost there. The last piece is that these separate XML files need to be combined into one XML file for each type, so HISTORY_001.XML to HISTORY_099.XML need to be combined, then QUERIES_001.XML to QUERIES_099.XML need to be combined, but only after a specific point in the file. I have a regex for the files that will select the parts that I want, now I just need to figure out how to loop through each file subset. Maybe I jumped the gun and should do it before moving them, but assuming they are all in one directory, how can I go about this?

Here is an example of the data. All of the XML files carry these same types of information.

Pages

<?xml version="1.0"?>
<project name="">
  <rundate></rundate>
  <object_type code="false" firstitem="1" id="5" items="65" name="Pages">
    <primary_key>Page Name</primary_key>
    <secondary_key>Language Code</secondary_key>
    <secondary_key>Page Field ID</secondary_key>
    <secondary_key>Field Type</secondary_key>
    <secondary_key>Record (Table) Name</secondary_key>
    <secondary_key>Field Name</secondary_key>
    <item id="ACCTG_TEMPLATE_AP">
      ...
    </item>
    <item id="ACCTG_TEMPLATE_AR">
      ...
    </item>
  </object_type>
</project>

Queries

<?xml version="1.0"?>
<project name="">
  <rundate></rundate>
  <object_type code="false" firstitem="1" id="10" items="46" name="Queries">
    <primary_key>Query Name</primary_key>
    <primary_key>User ID</primary_key>
    <item id="1099G_ALL_SHORT. ">
      ...
    </item>
    <item id="1099G_ALL_VOUCHERS. ">
      ...
    </item>
  </object_type>
</project>

Regex to pull out header

(?:(?!(^\s*i<item)).)*

Regex to pull out detail

^(\s*<item id=).*(</item>)

Regex to pull out footer

^(\s*</object_type).*

So I am assuming that what I want to do it have a counter, loop through each object type XML subset, if I am the first loop then pull the header and detail and output to a new summary file, then continue for all other files to concat the detail, then if the last file or change to a new object type then output the footer as well. Do you think this is possible using bash script?

what tool are you using for 'Regex to pull out (header|detail|footer)' Do you know about bash's ability to list and process files in a for loop? `for file in HISTORY_*.xml; do ./myXMLprocess $file ; done > oneBigXMLFile` for example? +1 for a good problem description AND the nutty-squirrel! Good luck. — shellter, Sep 24 '13 at 18:31
Oh, dear. You appear to be trying to use regular expressions to parse XML. That's a path that leads to [fairly unpleasant places](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). Do you really not have any scripting languages with real XML libraries available? — Brian Campbell, Sep 24 '13 at 20:19
You can't install "stuff", but you can run arbirtary Java code? That's ... interesting. — tripleee, Sep 24 '13 at 20:46
Is the size of the header and footer (number of lines) the same in each file type? If so could use `tail` piped together to get just the middle lines. — beroe, Sep 24 '13 at 21:01
No, I cannot install new software to the server, but yes I can create and run arbitrary Java code. It's a policy issue of definitions that gets in my way a lot. I wanted them to install curl but that was a no-go at the administration level. Yes, the size of the header and footer is the same in each object type XML group, though different between groups. I had also thought of doing a regex search to find the tag positions and then working it out that way. — onnonnononnon, Sep 25 '13 at 13:07

jthill · Answer 1 · 2013-09-24T23:34:24.687

This will spit commands to do the sorting and classification, just provide functions/scripts/whatever that do the right thing for files that are first, middle, last, or only in a group. The first and middle commands have to handle empty argument lists, middle for two-element groups and first for groups without a 1-sequenced file.

Edit: I broke the seds out to one command per line to handle seds that don't like semicolons

Run this as e.g. sh this.sh *_*.*

#!/bin/sh
#
# spit commands to sort, group, and classify argument filenames 
# sorting by the number between `_` and `.` in their names and 
# grouping by the text before the _.
{
# Everything through the sort would just be `ls -v` on GNU/anything...
for f; do
    pfx=${f%%_*}
    tail=${f#*_}
    sortable=`printf %s_%03d.%s $pfx ${tail%.*} ${tail##*.}`
    [ $f != $sortable ] \
      && echo  mv $f $sortable >&2
    echo $sortable
done \
| sort \
| sed '
    /_0*1\./! H
    // {
       x
       1! {
          y/\n/ /
          p
       }
    }
    $!d
    x
    y/\n/ /
' \
| sed '
    s/\([^ ]*\)\(.*\) \(.*\)/first \1\nmiddle\2\nlast \3/
    t
    s/^/only /
'
} 2>&1

The first of the above seds accumulates groups of one-per-line Words that can be identified by their first line. The second classifies the groups and subs in the right commands. They're separate because the first sed involves a double-pump to handle a widow group, plus they're hairy enough as it is.

score 0 · Answer 2 · answered Oct 29 '13 at 11:00

combine()
{
    # pull the header from 1st file
    while IFS= read && word=($REPLY) && [ "$word" != "<item" ]
    do  echo "$REPLY"
    done <$1

    # concat the detail from all files
    for file
    do  cmd=:
        while IFS= read && word=($REPLY)
        do  case $word in \<item) cmd=echo;; esac
            $cmd "$REPLY"
            case $word in \</item\>) cmd=:;; esac
        done <$file
    done

    # output the footer
    while IFS= read && word=($REPLY)
    do  case $word in \</object_type\>) cmd=echo;; esac
        $cmd "$REPLY"
    done <$file
}

combine PAGES_???.XML >PAGES.XML
combine QUERIES_???.XML >QUERIES.XML

Command line combine files at change in part of name and part of file

2 Answers2