Attempt 1: Optimizing the query
Some observations:
- Instead of
p:P31/ps:P31
, you could use wdt:P31
which is faster by avoiding the two-property hop, but finds only the truthy statements
- The expensive part is the call to the label service at the end, as can be seen by commenting that line out by placing
#
at the start of the line
- The query retrieves every claim on every city (many!), gets the properties of the claims (few!), and only removes the duplicates in the end (with
DISTINCT
)
- As a result, the label service is called many times for the same property, once per claim! This is the big problem with the query
- This can be avoided by moving the retrieval of properties with the
DISTINCT
into a subquery, and calling the label service only at the end on the few properties
- After that change it should be fast, but is still slow because the query optimiser seems to evaluate the query in the wrong order. Following hints from this page, we can turn the query optimiser off.
This works for me:
SELECT ?p ?pLabel ?pAltLabel {
hint:Query hint:optimizer "None" .
{
SELECT DISTINCT ?p_ {
VALUES ?class { wd:Q515 }
?x wdt:P31 ?class.
?x ?p_ [].
}
}
?p wikibase:claim ?p_.
?p wikibase:propertyType ?pType.
FILTER (?pType = wikibase:Quantity)
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
Attempt 2: Splitting the task into multiple queries
Having learned that the approach above doesn't work for some of the biggest categories (like wd:Q5
, “human”), I tried a different approach. This will get all results, but not in a single query. It requires sending ~25 individual queries and combining the results afterwards:
- We start by listing the quantity properties. There are, as of today, 503 of them.
- We want to keep only those properties that are actually used on an item of type “human”.
- Because that check is so slow (it needs to look at millions of items), we start by only checking the first 20 properties from our list.
- In the second query, we're going to check the next 20, and so on.
This is the query that tests the first 20 properties:
SELECT DISTINCT ?p ?pLabel ?pAltLabel {
hint:Query hint:optimizer "None" .
{
SELECT ?p ?pLabel ?pAltLabel {
?p wikibase:propertyType wikibase:Quantity.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
OFFSET 0 LIMIT 20
}
?p wikibase:claim ?p_.
?x ?p_ [].
?x wdt:P31 wd:Q5.
}
Increase the OFFSET
to 20, 40, 60 and so on, up to 500, to test all properties.