2

I've written a query without using group_concat and get 9 rows returned, one row for each occupation that Einstein held.

When I add group_concat on the occupation column, that column is null. I don't understand what I'm doing wrong here. What I expect to see is 1 row with all 9 occupations in the occupations column.

Here is the simple query.

SELECT ?item ?itemLabel ?genderLabel (GROUP_CONCAT(?occupationLabel) AS ?occupations)
WHERE {
  ?item wdt:P31 wd:Q5.
  ?item ?label "Albert Einstein"@en.
  ?item wdt:P21 ?gender .
  ?item wdt:P106 ?occupation .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
GROUP BY ?item ?itemLabel ?genderLabel

EDIT:

Here is the code that is producing the duplicate values.

SELECT ?item ?itemLabel ?genderLabel (GROUP_CONCAT(?occupationLabel) AS ?occupations)
WHERE {
  ?item wdt:P31 wd:Q5.
  ?item ?label "Albert Einstein"@en.
  ?item wdt:P21 ?gender .
  OPTIONAL {
    ?item wdt:P106 ?occupation .
    ?occupation rdfs:label ?occupationLabel
    FILTER(LANGMATCHES(LANG(?occupationLabel), 'en'))
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
GROUP BY ?item ?itemLabel ?genderLabel

Running this query gives me the following:

professor professor physicist physicist inventor educationalist educationist university teacher academic science writer non-fiction writer philosopher of science theoretical physicist

professor and physicist are duplicated


2nd EDIT

Also worth noting is that when I modify the query to not use rdfs:label, I get the right concated result in the occupation column (I've added the parentheses and the labels to the URL):

http://www.wikidata.org/entity/Q121594 (professor) 
http://www.wikidata.org/entity/Q169470 (physicist) 
http://www.wikidata.org/entity/Q205375 (inventor) 
http://www.wikidata.org/entity/Q1231865 (educationalist)
http://www.wikidata.org/entity/Q1622272 (university teacher)
http://www.wikidata.org/entity/Q3745071 (science writer)
http://www.wikidata.org/entity/Q15980158 (non-fiction writer)
http://www.wikidata.org/entity/Q16389557 (philosopher of science)
http://www.wikidata.org/entity/Q19350898(theoretical physicist)

So, I guess my question now is, is it possible to get one label per ID?

NaN
  • 1,286
  • 2
  • 16
  • 29
  • 1
    It doesn't work with the Wikidata label service. You have to use "good old" SPARQL, i.e. `?occupation rdfs:label ?occupationLabel . FILTER(langmatches(lang(?occupationLabel), 'en'))` – UninformedUser Feb 18 '18 at 20:06
  • 2
    Or like this: `SERVICE wikibase:label { bd:serviceParam wikibase:language "en". ?item rdfs:label ?itemLabel . ?gender rdfs:label ?genderLabel . ?occupation rdfs:label ?occupationLabel }` – UninformedUser Feb 18 '18 at 20:10
  • @AKSW, Thanks for the help. I've actually managed to find that same syntax on a different post just moments ago, and yes, that does help. I am now having another issue with the results though. Even though I use `distinct`, I am still getting dupes in the result set. Would you know why this is? – NaN Feb 18 '18 at 20:35
  • You use `distinct` where? Inside the `group_concat`? And where do you get the duplicate? – UninformedUser Feb 19 '18 at 02:28
  • I'm using distinct inside group_concat. I ran that query without `distinct` and can see many more duplicates, so it is in fact taking *most* of the dupes out of the result, but there are still 2 or 3 duplicated values for some of the rows in the result set. The duplicate values appear in the `occupations` column after the occupations were `group_concat`'ed. – NaN Feb 19 '18 at 04:35
  • @AKSW, I am posting the SPARQL up top in an edit. If you run it, you can see the duplicated values. Again, thanks for the help. – NaN Feb 19 '18 at 04:48
  • 1
    Please use `SERVICE wikibase:label { bd:serviceParam wikibase:language "en". ?item rdfs:label ?itemLabel . ?gender rdfs:label ?genderLabel . ?occupation rdfs:label ?occupationLabel }` – UninformedUser Feb 19 '18 at 08:31
  • @AKSW, that works, thank you! If you want to post your last comment as the answer, I'd be happy to credit you for it. I sure am glad there are people around like you that know rdf! Just curious, but can you tell me what the essence of your code does that fixes this? – NaN Feb 19 '18 at 08:41

1 Answers1

6

The rough idea is to use dedicated SPARQL triple patterns for getting the label instead of the "label service":

SELECT ?item ?itemLabel ?genderLabel (GROUP_CONCAT(?occupationLabel) AS ?occupations)
WHERE {
  ?item wdt:P31 wd:Q5.
  ?item ?label "Albert Einstein"@en.
  ?item wdt:P21 ?gender .
  OPTIONAL {
    ?item wdt:P106 ?occupation .
  }
  SERVICE wikibase:label { 
    bd:serviceParam wikibase:language "en". 
    ?item rdfs:label ?itemLabel . 
    ?gender rdfs:label ?genderLabel . 
    ?occupation rdfs:label ?occupationLabel 
  }
}
GROUP BY ?item ?itemLabel ?genderLabel
UninformedUser
  • 8,397
  • 1
  • 14
  • 23
  • Thanks for the explanation. Can you tell me if it is possible to do a case insensitive search with this query? – NaN Feb 20 '18 at 18:10
  • 1
    With standard SPARQL this will be pretty expensive as you have to scan all literal values in the dataset and do String comparison: `?item ?label ?labelValue . FILTER(lcase(str(?labelValue)) = lcase("Albert Einstein"))` I'm not sure whether this will scale. Or `FILTER(contains(lcase(str(?labelValue), "albert"))`Alternatively, you can use `regex` function with `i` flag, but it's also expensive. Best case would be using a fulltext index, not sure if something exists for Wikidata – UninformedUser Feb 20 '18 at 19:08
  • 1
    Ok, fulltext search is **not** enabled for Wikidata, but would be possible via the Blazegraph endpoint. I guess you have to live with the `FILTER`-based solutions. – UninformedUser Feb 20 '18 at 19:13
  • OK, thanks, AKSW. I was afraid that you were going to say that. I've already tried your suggestions and the connection times out on the string comparison as well as regex. Thanks for pointing me to Blazegraph. I'll look into that today. Thanks again, man! – NaN Feb 20 '18 at 19:55