Print specific "columns" of text file that has a somewhat inconsistent format

Question

I'm trying to extract three columns from a text file that looks like this:

<Record type="HKQuantityTypeIdentifierHeartRate" sourceName="Michael’s Apple Watch" sourceVersion="6.2.5" device="&lt;&lt;HKDevice: 0x2877dc870&gt;, name:WHOOP 3A020013, manufacturer:WHOOP Inc., localIdentifier:80A56B86-0DEC-A6C3-7B22-077BD4BE4C8D&gt;" unit="count/min" creationDate="2020-05-30 07:26:39 -0400" startDate="2020-05-30 07:26:39 -0400" endDate="2020-05-30 07:26:39 -0400" value="72">
<Record type="HKQuantityTypeIdentifierHeartRate" sourceName="Wahoo" sourceVersion="3135" unit="count/min" creationDate="2020-05-30 07:37:05 -0400" startDate="2020-05-30 07:35:46 -0400" endDate="2020-05-30 07:37:01 -0400" value="83"/>

This is the information I'd like to extract:

sourceName, creationDate, value
"Michael’s Apple Watch", "2020-05-30 07:26:39", "72"
"Wahoo", "2020-05-30 07:37:05", "83"

So I basically need the source name, full creationDate and value in a comma-separated format.

The issue I'm having is that sourceName itself has multiple nested "fields" and creationDate has spaces. So my previous attempts using grep and awk all failed :)

Any help would be greatly appreciated.

[Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). — Cyrus, May 30 '20 at 18:27

Ed Morton · Accepted Answer · 2020-05-31T03:19:01.247

Whenever you have "tag=value" data, its best to first create an array indexed by tags (names) and then you can just test or print whatever you you want in whatever order you want. Assuming your input is a regular as the sample you posted and you can't use an XML parser then using GNU awk for the 3rd arg to match():

$ cat tst.awk
BEGIN {
    OFS = ", "
    numTags = split("sourceName creationDate value",tags)
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tags[tagNr]
        printf "%s%s", tag, (tagNr<numTags ? OFS : ORS)
    }
}
{
    delete tag2val
    while ( match($0,/([^=[:space:]]+)=("[^"]+")/,a) ) {
        tag = a[1]
        val = a[2]
        tag2val[tag] = val
        $0 = substr($0,RSTART+RLENGTH)
    }
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tags[tagNr]
        val = tag2val[tag]
        printf "%s%s", val, (tagNr<numTags ? OFS : ORS)
    }
}

.

$ awk -f tst.awk file
sourceName, creationDate, value
"Michael’s Apple Watch", "2020-05-30 07:26:39 -0400", "72"
"Wahoo", "2020-05-30 07:37:05 -0400", "83"

score 1 · Answer 2 · answered May 30 '20 at 19:13

With this valid file.xml:

<root>
  <Record type="HKQuantityTypeIdentifierHeartRate" sourceName="Michael&#x2019;s Apple Watch" sourceVersion="6.2.5" device="&lt;&lt;HKDevice: 0x2877dc870&gt;, name:WHOOP 3A020013, manufacturer:WHOOP Inc., localIdentifier:80A56B86-0DEC-A6C3-7B22-077BD4BE4C8D&gt;" unit="count/min" creationDate="2020-05-30 07:26:39 -0400" startDate="2020-05-30 07:26:39 -0400" endDate="2020-05-30 07:26:39 -0400" value="72"/>
  <Record type="HKQuantityTypeIdentifierHeartRate" sourceName="Wahoo" sourceVersion="3135" unit="count/min" creationDate="2020-05-30 07:37:05 -0400" startDate="2020-05-30 07:35:46 -0400" endDate="2020-05-30 07:37:01 -0400" value="83"/>
</root>

Command:

xmlstarlet select --text --template --match '//Record' --value-of \
  "concat('\"',@sourceName,'\", \"', @creationDate,'\", \"',@value,'\"')" -n file.xml

Output:

"Michael’s Apple Watch", "2020-05-30 07:26:39 -0400", "72"
"Wahoo", "2020-05-30 07:37:05 -0400", "83"

score 1 · Answer 3 · answered May 30 '20 at 20:26

It's much better to use an XML parser like in the answer of Cyrus.

But if all of your data is consistent with the small sample you provided, this might work for you:

BEGIN { q="\""; FS="="; OFS=","; print "sourceName,creationDate,value" }
{
    for (i=1; i<NF; ++i) {
        v = $(i+1)
        split(v, a, q)
        if ($i ~ / sourceName$/) sourceName = q a[2] q
        else if ($i ~ / creationDate$/) creationDate = q a[2] q
        else if ($i ~ / value$/) value = q a[2] q
    }
    print sourceName, creationDate, value
}

==

$ awk -f a.awk file
sourceName,creationDate,value
"Michael’s Apple Watch","2020-05-30 07:26:39 -0400","72"
"Wahoo","2020-05-30 07:37:05 -0400","83"

Print specific "columns" of text file that has a somewhat inconsistent format

3 Answers3