As SPARQL is a pattern matching language, the trick, when your query result is "too broad/general", is to create a more specific pattern. In this case, your intent is not just to get back all resources that are marked as dbo:birthPlace
values, but only those resources that represent U.S. states.
So we need to figure out how U.S. states are distinguished from other locations in DBPedia.
Let's take Kentucky as an example. The resource representing Kentucky is http://dbpedia.org/resource/Kentucky . If we scroll down the page outlining the properties of that resource, we find multiple entries for the rdf:type
relation, but the one that jumps out at me as most suitable is yago:WikicatStatesOfTheUnitedStates
(http://dbpedia.org/class/yago/WikicatStatesOfTheUnitedStates).
If we modify your query to put that in as an extra restriction, and drop the weird regular expression, like so:
SELECT DISTINCT ?person ?birthPlace ?presidentStart ?presidentEnd
WHERE {
?person dct:subject dbc:Presidents_of_the_United_States.
?person dbo:birthPlace ?birthPlace .
?birthPlace a yago:WikicatStatesOfTheUnitedStates .
OPTIONAL { ?person dbp:presidentEnd ?presidentEnd } .
OPTIONAL { ?person dbp:presidentStart ?presidentStart } .
}
GROUP BY ?person
ORDER BY ?presidentStart ?person
LIMIT 100
You should get what you need.
Unfortunately, if you try, you find that you don't. This is because DBPedia data is messy. The above query only returns three results, and worse, one result is clearly incorrect:
person birthPlace presidentStart presidentEnd
dbr:Barack_Obama dbr:Hawaii
dbr:George_Washington dbr:Virginia
dbr:Theodore_Roosevelt dbr:New_York_City
There's two things going on here: first of all, New York City is incorrectly classified as a state in DBPedia. Secondly, most presidents do not explicitly have their state marked as their birthplace, but only things like their home town.
Fortunately, we can amend slightly. DBPedia knows that HodgenVille, Kentucky, is located in Kentucky. How does it know? Well, have a look at the resource page for Hodgenville: http://dbpedia.org/resource/Hodgenville,_Kentucky . You'll see that it has a dbo:isPartOf
relation with the resource representing the state of Kentucky.
So, we need to rephrase our query again: we want the state for each president where their birthplace is part of that state. In SPARQL:
SELECT DISTINCT ?person ?birthState ?presidentStart ?presidentEnd
WHERE {
?person dct:subject dbc:Presidents_of_the_United_States.
?person dbo:birthPlace ?birthPlace .
?birthPlace dbo:isPartOf ?birthState .
?birthState a yago:WikicatStatesOfTheUnitedStates .
OPTIONAL { ?person dbp:presidentEnd ?presidentEnd } .
OPTIONAL { ?person dbp:presidentStart ?presidentStart } .
}
GROUP BY ?person
ORDER BY ?presidentStart ?person
LIMIT 100
This should get you almost completely the result you need.
Update as you noted, Donald Trump is missing from the list. This looks to be because DBPedia is behind the times, and he's still classified as a "presidential candidate" rather than a president.
As for Grover Cleveland appearing four times, this is an interesting anomaly. Cleveland served two non-consecutive terms as president, from 1885 to 1889, and again from 1893 to 1897. So there's two start dates, and two end dates. Because in DBPeda it is not explicitly modeled which start date belongs to which end date, you simply get a result for each combination of start and end dates, four in total. There may be a way to query around this (one option would be to group start and end dates together using a group_concat
aggregate), but it's such an edge case that it might be simpler to just handle it in post-processing.