0

I have a file that has lots of XML nodes:

<output>
<file name="user.java">
</file>

<file name="random.java">
<error line="52" column="3" severity="warning" message="User is not found." source="randomSource"/>
</file>
<output/>

Now I need to replace the source in the error node with the name attribute in the file and print it to a file. So the output file should have only rows of error:

<error line="52" column="3" severity="warning" message="User is not found." name="customer.java"/>

preferably the name should be the first attribute:

<error name="random.java" line="52" column="3" severity="warning" message="User is not found." />

So the new file should only contain the error nodes and I can only use the default tools such as sed/awk/cut/etc...

I have only got as far as printing the error line but can't figure out how to do the above:

awk -vtag=file -vp=0 '{
if($0~("^<"tag)){p=1;next}
if($0~("^</"tag)){p=0;printf("\n");next}
if(p==1){$1=$1;printf("%s",$0)}
}' infile 
Stacky
  • 79
  • 5
  • 3
    [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). A surgeon does not use a chainsaw to operate either. – Cyrus Aug 21 '21 at 07:53
  • Get permission, then. – Shawn Aug 21 '21 at 09:41
  • 1
    Please [edit] your question to show the expected output given your posted input as different people answering are guessing at different output you might want. – Ed Morton Aug 22 '21 at 11:42

3 Answers3

3

Try this simple awk program:

level == 0 && $0 ~ "<" tag ".*>" {
    print
    level++
    # get "name" attribute
    gsub(/^.*name="/, "")
    gsub(/".*$/, "")
    name = $0
    next
}
level == 1 && /<error.*>/ {
    # remove "source" attribute
    gsub(/ source="[^"]*"/, "")
    # put "name" attribute at the beginning of "error" tag
    gsub(/<error /, "<error name=\"" name "\" ")
    print
    next
}
level == 1 && $0 ~ "</" tag ">" {
    print
    level--
    next
}
{
    print
}

Called like this:

$ cat xmlerr.xml | awk -v tag="file" -f xmlerr.awk 
<output>
    <file name="user.java">
    </file>
    
    <file name="random.java">
    <error name="random.java" line="52" column="3" severity="warning" message="User is not found."/>
    </file>
</output>

Remove unnecessary print commands

ALTERNATIVE

If you want tu suppress "name" attribute in the open "file" tag, the first bloc became:

level == 0 && $0 ~ "<" tag ".*>" {
    name = $0
    level++
    n = gsub(/^.*name="/, "", name)
    gsub(/".*$/, "", name)
    # if substitution done, remove "name" attribute in the original line before printing
    if (n > 0) {
        gsub(/ name="[^"]*"/, "")
    }
    print
    next
}

and the output:

<output>
    <file>
    </file>
    
    <file>
    <error name="random.java" line="52" column="3" severity="warning" message="User is not found."/>
    </file>
</output>
Arnaud Valmary
  • 2,039
  • 9
  • 14
  • Thank you so much for your reply. It is almost there. Is it possible to update the code to just print the error node. Currently it prints the file node and within it there is 2 error nodes. The old one plus the one that is transformed. – Stacky Aug 21 '21 at 08:49
  • 1
    That would fail given various values of `name`, e.g. `name="=my-name="` or `name="black&white"` or if the `tag` being passed in was a substring of some other field, e.g. `file` and `filetype` both existed, or if other tags that start with `error` existed, e.g. `errorHandling`, or if attributes like `groupname=` existed in a `file` tag. You're also using `gsub()` everywhere when in most cases you really want `sub()`. – Ed Morton Aug 21 '21 at 12:45
  • 1
    For a task like this you've got to make sure you have boundaries of some kind around the strings you're matching on, and use string rather than regexp operators with any input string which, among other things, means you can't read a string from the input and then use it in any regexp or regexp-replacement context, e.g. as the first or 2nd argument of `*sub()`, unless you sanitize all metachars first (e.g. see https://stackoverflow.com/q/29613304/1745001). – Ed Morton Aug 21 '21 at 13:19
  • 1
    Hello Stacky, what do you want exactly ? Suppress the node "``" + "``" or just suppress attribute "`name`" in "`file`" tag ? In the first case, remove "print" commands. In the second case, look at the new ALTERNATIVE in my response – Arnaud Valmary Aug 21 '21 at 17:18
  • Thank you for explaining this. I marked the last one as solution as it worked as expected but upvoting all your comments and answer. – Stacky Aug 22 '21 at 23:22
2

Try this Perl solution:

$ cat stacky.txt
<output>
<file name="user.java">
</file>

<file name="random.java">
<error line="52" column="3" severity="warning" message="User is not found." source="randomSource"/>
</file>
<output/>
   
$ perl -ne  ' /<file (name=\S+)>/ and $x=$1; if(/<error/) { s/(\<error)(.*)(\bsource="[^"]+")(.+)/$1 $x $2 $4/g  ; print }  ' stacky.txt
<error name="random.java"  line="52" column="3" severity="warning" message="User is not found."  />
stack0114106
  • 8,534
  • 3
  • 13
  • 38
1

Assuming your input really is structured as you show in your example (i.e. no newlines within <...>s, and only 1 set of <...>s per line, and all white space in each line is blank chars) then using any awk in any shell on every Unix box and using literal string operations with blanks as boundaries so it'll work even if any regexp or backreference metachars exist in the text or if any of the target strings are substrings of other strings:

$ cat tst.awk
{ tag=$0; gsub(/^ *< *| .*$/,"",tag) }

(tag == "file") && match($0,/ name="[^"]+"/) {
    name = substr($0,RSTART+1,RLENGTH-1)
}

(tag == "error") && match($0,/ source="[^"]+"/) {
    $0 = substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH)
    match($0,/ *< *[^ ]+ /)
    $0 = substr($0,1,RLENGTH) name substr($0,RSTART+RLENGTH-1)
}

{ print }

$ awk -f tst.awk file
<output>
<file name="user.java">
</file>

<file name="random.java">
<error name="random.java" line="52" column="3" severity="warning" message="User is not found."/>
</file>
<output/>

or if you prefer to just replace the source= with name= in-situ:

$ cat tst.awk
{ tag=$0; gsub(/^ *< *| .*$/,"",tag) }

(tag == "file") && match($0,/ name="[^"]+"/) {
    name = substr($0,RSTART+1,RLENGTH-1)
}

(tag == "error") && match($0,/ source="[^"]+"/) {
    $0 = substr($0,1,RSTART) name substr($0,RSTART+RLENGTH)
}

{ print }

$ awk -f tst.awk file
<output>
<file name="user.java">
</file>

<file name="random.java">
<error line="52" column="3" severity="warning" message="User is not found." name="random.java"/>
</file>
<output/>

If you ONLY want the "error" line printed then in the above just change:

}

{ print }

to:

    print
}

so the print only happens inside the tag == "error" block.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185