Query separate lists of distinct variables in SPARQL?

Question

Say I have a query like this:

WHERE {
<http://purl.uniprot.org/uniprot/Q8NAT1> up:classifiedWith ?annotation .
?protein up:classifiedWith ?annotation .
    <http://purl.uniprot.org/uniprot/Q8NAT1> up:annotation ?O3OET.
    ?O3OET a up:Topological_Domain_Annotation;
         rdfs:comment ?topology;
         up:range ?Q02UJ .
    ?protein a up:Protein .
    ?protein up:annotation ?otherTop .
    ?otherTop a up:Topological_Domain_Annotation;
             rdfs:comment ?topology;
             up:range ?OTHERRANGE .
    <http://purl.uniprot.org/uniprot/Q8NAT1> up:annotation ?S7IK0.
    ?S7IK0 a up:Pathway_Annotation ;
          rdfs:seeAlso ?pathway .
    ?protein a up:Protein .
    ?protein up:annotation ?VAR2 .
    ?VAR2 a up:Pathway_Annotation ;
          rdfs:seeAlso ?pathway .
<http://purl.uniprot.org/uniprot/Q8NAT1> up:citation ?citation .
?protein up:citation ?citation .
}
GROUP BY ?protein

Where I'm trying to query unique instances of each variable, without the full Cartesian Product that SPARQL typically does. I now want a list of all distinct variable matches for each queried variable.

ie., if there are 10 distinct proteins, and 2 distinct annotations, how do I get these results? Do I have to make separate queries?

Does this answer your question? [Aggregating results from SPARQL query](https://stackoverflow.com/questions/18212697/aggregating-results-from-sparql-query) — Jeen Broekstra, Mar 05 '20 at 04:27
In UniProt each annotation has it's own IRI. @JeenBroekstra answer is right but feel free to write to help@uniprot.org for the biology and uniprot datamodel. — Jerven, Mar 06 '20 at 15:23
@JeenBroekstra this is actually exactly what I needed. Thanks so much! — Kenny Workman, Mar 11 '20 at 14:58

Jeen Broekstra · Accepted Answer · 2020-03-05T08:16:12.390

There are several possible approaches to this.

Use a `CONSTRUCT` query

When selecting loads of different variables, you get a "Cartesian" result because you're representing multiple pattern matches as a tabular structure: each slightly different match gets its own 'row' in the result. A CONSTRUCT query does not return a tabular structure, but returns the subgraph that matches your data. Assuming you are using a library that has some decent support for RDF graph traversal, this might actually be easier and more natural to process than a complex SELECT query.

Use `GROUP_CONCAT`

You can use the GROUP_CONCAT aggregate operator to produce a result where multiple values for a variable are concatenated into a single string. For example, if you previously had this:

  SELECT ?protein ?annotation
   ....

and you got back something like this:

protein1 annotation1
protein1 annotation2
protein2 annotation3
protein2 annotation4
...

you can use this instead:

SELECT ?protein (GROUP_CONCAT(?annotation) as ?annotations)

and your result will look like this:

protein1 "annotation1 annotation2"
protein2 "annotation3 annotation4"

use multiple queries

Another option is to use multiple queries: the first query just retrieves the resource identifiers (the proteins, in your case). Then you iterate over the result and for each resource identifier, do a followup query that gets the additional attributes of interest for that particular resource.

Thanks! Any downsides to CONSTRUCT queries either in implementation or conceptually as a query model other than those mentioned? (Do they see real world implementation?) — Kenny Workman, Mar 11 '20 at 14:55
The only real downside I can think of is that, as a developer, it takes some getting used to thinking about a query result being a graph, rather than a table. But implementation wise it should not make much of a difference. I've certainly used them in "real world" applications. — Jeen Broekstra, Mar 11 '20 at 20:33

Query separate lists of distinct variables in SPARQL?

1 Answers1

Use a CONSTRUCT query

Use GROUP_CONCAT

use multiple queries

Use a `CONSTRUCT` query

Use `GROUP_CONCAT`