0

I'm trying to extract three columns from a text file that looks like this:

<Record type="HKQuantityTypeIdentifierHeartRate" sourceName="Michael’s Apple Watch" sourceVersion="6.2.5" device="&lt;&lt;HKDevice: 0x2877dc870&gt;, name:WHOOP 3A020013, manufacturer:WHOOP Inc., localIdentifier:80A56B86-0DEC-A6C3-7B22-077BD4BE4C8D&gt;" unit="count/min" creationDate="2020-05-30 07:26:39 -0400" startDate="2020-05-30 07:26:39 -0400" endDate="2020-05-30 07:26:39 -0400" value="72">
<Record type="HKQuantityTypeIdentifierHeartRate" sourceName="Wahoo" sourceVersion="3135" unit="count/min" creationDate="2020-05-30 07:37:05 -0400" startDate="2020-05-30 07:35:46 -0400" endDate="2020-05-30 07:37:01 -0400" value="83"/>

This is the information I'd like to extract:

sourceName, creationDate, value
"Michael’s Apple Watch", "2020-05-30 07:26:39", "72"
"Wahoo", "2020-05-30 07:37:05", "83"

So I basically need the source name, full creationDate and value in a comma-separated format.

The issue I'm having is that sourceName itself has multiple nested "fields" and creationDate has spaces. So my previous attempts using grep and awk all failed :)

Any help would be greatly appreciated.

  • Please post valid HTML/XHTML/XML. – Cyrus May 30 '20 at 18:26
  • 2
    [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus May 30 '20 at 18:27

3 Answers3

2

Whenever you have "tag=value" data, its best to first create an array indexed by tags (names) and then you can just test or print whatever you you want in whatever order you want. Assuming your input is a regular as the sample you posted and you can't use an XML parser then using GNU awk for the 3rd arg to match():

$ cat tst.awk
BEGIN {
    OFS = ", "
    numTags = split("sourceName creationDate value",tags)
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tags[tagNr]
        printf "%s%s", tag, (tagNr<numTags ? OFS : ORS)
    }
}
{
    delete tag2val
    while ( match($0,/([^=[:space:]]+)=("[^"]+")/,a) ) {
        tag = a[1]
        val = a[2]
        tag2val[tag] = val
        $0 = substr($0,RSTART+RLENGTH)
    }
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tags[tagNr]
        val = tag2val[tag]
        printf "%s%s", val, (tagNr<numTags ? OFS : ORS)
    }
}

.

$ awk -f tst.awk file
sourceName, creationDate, value
"Michael’s Apple Watch", "2020-05-30 07:26:39 -0400", "72"
"Wahoo", "2020-05-30 07:37:05 -0400", "83"
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
1

With this valid file.xml:

<root>
  <Record type="HKQuantityTypeIdentifierHeartRate" sourceName="Michael&#x2019;s Apple Watch" sourceVersion="6.2.5" device="&lt;&lt;HKDevice: 0x2877dc870&gt;, name:WHOOP 3A020013, manufacturer:WHOOP Inc., localIdentifier:80A56B86-0DEC-A6C3-7B22-077BD4BE4C8D&gt;" unit="count/min" creationDate="2020-05-30 07:26:39 -0400" startDate="2020-05-30 07:26:39 -0400" endDate="2020-05-30 07:26:39 -0400" value="72"/>
  <Record type="HKQuantityTypeIdentifierHeartRate" sourceName="Wahoo" sourceVersion="3135" unit="count/min" creationDate="2020-05-30 07:37:05 -0400" startDate="2020-05-30 07:35:46 -0400" endDate="2020-05-30 07:37:01 -0400" value="83"/>
</root>

Command:

xmlstarlet select --text --template --match '//Record' --value-of \
  "concat('\"',@sourceName,'\", \"', @creationDate,'\", \"',@value,'\"')" -n file.xml

Output:

"Michael’s Apple Watch", "2020-05-30 07:26:39 -0400", "72"
"Wahoo", "2020-05-30 07:37:05 -0400", "83"
Cyrus
  • 84,225
  • 14
  • 89
  • 153
1

It's much better to use an XML parser like in the answer of Cyrus.

But if all of your data is consistent with the small sample you provided, this might work for you:

BEGIN { q="\""; FS="="; OFS=","; print "sourceName,creationDate,value" }
{
    for (i=1; i<NF; ++i) {
        v = $(i+1)
        split(v, a, q)
        if ($i ~ / sourceName$/) sourceName = q a[2] q
        else if ($i ~ / creationDate$/) creationDate = q a[2] q
        else if ($i ~ / value$/) value = q a[2] q
    }
    print sourceName, creationDate, value
}

==

$ awk -f a.awk file
sourceName,creationDate,value
"Michael’s Apple Watch","2020-05-30 07:26:39 -0400","72"
"Wahoo","2020-05-30 07:37:05 -0400","83"
jas
  • 10,715
  • 2
  • 30
  • 41