Is there a better XPATH syntax than ////[12] when //.../@NODE does not work?

Question

While I am successful in getting (US) PMMS 30Y Fixed Mortgage Rates via XPATH, I am not as successful in getting the (US) Federal Curve Yields. What's different about these XML data sets that makes one work and not the other?

I have tried many variations and the only variation that works with this data set is using some series of */*/*/* with an array qualifier [12].

The total count(*//) of elements in this data set is 180.

I am using Google Sheets IMPORTXML(url,xpath) to perform this action.

THIS WORKS

=IMPORTXML("https://www.treasury.gov/resource-center/data-chart-center/interest-rates/pages/XmlView.aspx?data=yield","*/*/*/*/*[12]")

THIS DOES NOT

=IMPORTXML("https://www.treasury.gov/resource-center/data-chart-center/interest-rates/pages/XmlView.aspx?data=yield","*//d:BC_10YEAR")

NOR ANY OF THESE

*//*@d:BC_10YEAR
*//@d:BC_10YEAR
*//d:BC_10YEAR[1]
*//m:properties[@d:BC_10YEAR]
*//*m:properties[@d:BC_10YEAR]

And many other variations -- I think I have tried them all, but OBVIOUSLY something is eluding me.

The ideal result is to specifically get the latest BC_10YEAR which is as of this posting and according to the data set (2019-04-09T20:36:32Z) is 2.49.

<d:BC_10YEAR m:type="Edm.Double">2.49</d:BC_10YEAR>

DATA SETS ENCODED URLS

US TREASURY: https://data.treasury.gov/feed.svc/DailyTreasuryYieldCurveRateData?$filter=month(NEW_DATE)%20eq%204%20and%20year(NEW_DATE)%20eq%202019

FREDDIE MAC http://www.freddiemac.com/pmms/pmmsthin.html

With XPATH Query of:

http://www.freddiemac.com/pmms/pmmsthin.html        IMPORTXML(url,xpath)
==========================================================================  
XPATH   //th[contains(text(),"30YR")]//..//td[1]    4.08    <== 30YR FMR
XPATH   //th[contains(text(),"30YR")]//..//td[2]    0.5     <== 30YR Fees & Points

Maybe a name space problem. XPath does not know name spaces. See here: https://stackoverflow.com/q/4440451/402322 — ceving, Apr 10 '19 at 07:17
Thank you -- good info and insight, reaffirmed by Michael's response as well. I will try both the 1.0 and 2.0 variants to see if either or both work. This may also tell us if Google Sheets IMPORTXML function is 1.0 or 2.0 as well. — Derek A, Apr 10 '19 at 20:00

score 0 · Accepted Answer · answered Apr 10 '19 at 09:30

0

When you say "better", what are your criteria? Are you after performance, readability, or are you looking for an expression that will still work if there are minor changes to the page content?

Probably the main reason your other attempts are failing is because of namespaces. If there are names in the XPath expression then both the local name and namespace URI need to match with what is in the source document. The way prefixes are bound to namespaces for executing XPath depend on the XPath API you are using, and some people find it more convenient to ignore namespaces by using constructs of the form *[local-name()='xyz']/*[local-name()='abc']. In XPath 2.0 you can write /*:xyz/*:abc, but I guess if you were using XPath 2.0 you would probably have said so.

answered Apr 10 '19 at 09:30

Michael Kay

156,231
11
92
164

Michael -- thank you. By BETTER I meant, a more intelligent and dynamic construct. The construct I am using is bound to the underlying framework (style) of the layout of the document. If the provider changes that format , my construct will not return the desired result. As for XPath 1.0 or 2.0 -- I honestly don't know. I am using a Google Sheets IMPORTXML function and I am not able to locate any information about whether it uses 1.0 or 2.0 -- and I personally have not developed the knowledge to discern how to tell the difference. I will try your suggestion and post the results. – Derek A Apr 10 '19 at 19:57
I don't think "intelligent" here means anything, but I understand what you mean by "dynamic". Of course any XPath expression is liable to become invalid when the page changes; it's just that some paths are more fragile than others. Screen-scraping is the low life of software engineering, I'm afraid. – Michael Kay Apr 10 '19 at 23:30
The best documentation I could find for IMPORT_XML was at https://support.google.com/docs/answer/3093342 -- which points to w3schools of all places for more information on XPath! That doesn't inspire confidence. You can certainly assume with this kind of non-specification that if it supported XPath 2.0, they would say so. The IMPORTXML() api appears to provide no way to bind namespace prefixes, which is a pretty devastating limitation. – Michael Kay Apr 10 '19 at 23:38
While the **local-name()** certainly makes the query better (less dependent on format), the query `//*[local-name()='BC_10YEAR']` does return the list of rates (each matching node) as expected. Per `count(//*[local-name()='BC_10YEAR'])` there are **8** items, but I am unable to narrow the query down to a specific item. For example: `//*[local-name()='NEW_DATE'='2019-04-10T00:00:00'] and //*[local-name()='BC_10YEAR']` returns TRUE, because `NEW_DATE` is true, but does not return the specific item, or even a list. ... – Derek A Apr 11 '19 at 07:48
Both: `//*[local-name()='BC_10YEAR'][8]` or `//*[local-name()='BC_10YEAR'][last()]` returns the list of each of the **BC_10YEAR** values because that answers the query. I have tried several variants to no avail. I am confident this is resolvable, but at this moment it escapes me. I will keep at it and will post the answer when I discover it -- unless some one else beats me to it. – Derek A Apr 11 '19 at 07:48
`//*[local-name()='NEW_DATE'][contains(text(),'2019-04-10')]` returns just the specified date for the node as desired, so I am assuming I am only working with this node. Using this as the first-level filter, I appended `//*[local-name()='BC_10YEAR']` and get **#N/A**. Yet `//*[local-name()='BC_10YEAR']` will return the list. – Derek A Apr 11 '19 at 08:04
**SOLVED** `//*[local-name()='NEW_DATE'][contains(text(),'2019-04-10')]//..//*[local-name()='BC_10YEAR']` Thank you @Michael for heading me in SOME direction. – Derek A Apr 11 '19 at 08:16
**GOOGLE FORMULA** To keep the IMPORTXML() formula clean, I place it in its own cell, with references to a cell with just the URL and another with the XPATH query. The XPATH query formula is: `="//*[local-name()='NEW_DATE'][contains(text(),'"&text(today()-1,"yyyy-mm-dd")&"')]//..//*[local-name()='BC_10YEAR']"` Since I only need the previous day's update, I get today and subtract 1 (day) from it. Then reformat the date string to match the XML data. – Derek A Apr 11 '19 at 08:33

score 0 · Answer 2 · answered Feb 10 '21 at 10:51

I found a simple solution to get a full table that uses uses IMPORTFEED and SPLIT.

The final product is a single IMPORTFEED request at the "A2" cell, that returns something like this:

Date	1M	2M	3M	6M	1Y	2Y	3Y	5Y	7Y	10Y	20Y	30Y
2021-01-06T00:00:00	0.09	0.09	0.09	0.09	0.11	0.14	0.2	0.43	0.74	1.04	1.6	1.81
2021-01-11T00:00:00	0.09	0.08	0.08	0.1	0.1	0.14	0.22	0.5	0.84	1.15	1.68	1.88
2021-01-14T00:00:00	0.09	0.09	0.09	0.09	0.1	0.16	0.23	0.49	0.82	1.15	1.69	1.88
2021-01-20T00:00:00	0.08	0.08	0.08	0.1	0.1	0.13	0.19	0.45	0.78	1.1	1.65	1.84
2021-01-25T00:00:00	0.07	0.08	0.09	0.09	0.1	0.13	0.17	0.42	0.73	1.05	1.61	1.8

TL;DR

To get the most recent 10-year Treasury rate (BC_10YEAR):

=QUERY(ARRAYFORMULA(SPLIT(IMPORTFEED("https://data.treasury.gov/feed.svc/DailyTreasuryYieldCurveRateData?$orderby=NEW_DATE desc&$top=1")," ")),"Select Col12")

Explanation

1. Use XQuery on the URL

Assume I have a Named Range called UrlFilterCell, and it has something like this:

="NEW_DATE eq datetime'" & TEXT(TODAY()-1,"yyyy-MM-dd") & "'"

You can construct this string to be yesterday, or something exotic, like year(NEW_DATE) gt 2020 and Id mod 3 eq 0 (which returns every 3rd record where the year is after 2020).

Then the IMPORTFEED portion would be constructed with the $filter= querystring:

IMPORTFEED("https://data.treasury.gov/feed.svc/DailyTreasuryYieldCurveRateData?$filter="&UrlFilterCell)

2. Exploit the space-delimited `content`

The returned XML is formatted as an Atom feed, with the troublesome fields inside the <content>:

  <entry>
    ...
    <content type="application/xml">
      <m:properties>
        <d:Id m:type="Edm.Int32">7700</d:Id>
        <d:NEW_DATE m:type="Edm.DateTime">2020-10-06T00:00:00</d:NEW_DATE>
        <d:BC_1MONTH m:type="Edm.Double">0.08</d:BC_1MONTH>
        ...

Nastiness. Except, when using IMPORTFEED, everything inside <content> is automatically written out as a space-delimited string:

" 7784 2021-02-09T00:00:00 0.04 0.04 0.04 0.06 0.07 0.11 0.19 0.48 0.83 1.18 
1.78 1.95 1.95 "

Great! Simply use SPLIT to convert each delimited-string into separate columns.

Then, use QUERY to select only the values we want. If we were interested in the date (Col2) and all 12 durations (BC_1MONTH, ..., BC_30YEAR), then the formula would be:

=QUERY(ARRAYFORMULA(SPLIT(IMPORTFEED(UrlCell)," ")),"Select Col2,Col3,Col4,Col5,Col6,Col7,Col8,Col9,Col10,Col11,Col12,Col13,Col14")

Modify the columns to your needs, and don't forget to construct UrlCell to your needs. For example, "SELECT Col12" would return just the BC_10YEAR value.

Is there a better XPATH syntax than */*/*/*/*[12] when *//.../@NODE does not work?

2 Answers2

TL;DR

Explanation

1. Use XQuery on the URL

2. Exploit the space-delimited content

Is there a better XPATH syntax than ////[12] when //.../@NODE does not work?

2. Exploit the space-delimited `content`