1

I'm trying to extract all parents of a each given GO Id (a node) using EBI-RDF sparql endpoint, I was based on this two similar questions to formulate the query, here're two examples illustrating the problem:

Example 1 (Link to the structure):

biological_process (GO:0008150)
           |__ metabolic process (GO:0008152)
                           |__ methylation (GO:0032259)

In this example, using the following query:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dbpedia2: <http://dbpedia.org/property/>
PREFIX dbpedia: <http://dbpedia.org/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

PREFIX obo: <http://purl.obolibrary.org/obo/>

SELECT (count(?mid) as ?depth)
       (group_concat(distinct ?midId ; separator = " / ") AS ?treePath) 
FROM <http://rdf.ebi.ac.uk/dataset/go> 
WHERE {
    obo:GO_0032259 rdfs:subClassOf* ?mid .
    ?mid rdfs:subClassOf* ?class .
    ?mid <http://www.geneontology.org/formats/oboInOwl#id> ?midId.
}
GROUP BY ?treePath
ORDER BY ?depth

I got the desired results without problems:

c |              treePath
--|-------------------------------------
6 | GO:0008150 / GO:0008152 / GO:0032259

But when the term exists in multiple branches (e.g GO:0007267) as in the case below, the previous approach didn't work:

Example 2 (Link to the structure)

biological_process (GO:0008150)
           |__ cellular_process (GO:0009987)
           |           |__ cell communication (GO:0007154)
           |                       |__ cell-cell signaling (GO:0007267)
           |
           |__ signaling (GO:0023052)
                      |__ cell-cell signaling (GO:0007267)

The result:

c |                            treePath
--|---------------------------------------------------------------
15| GO:0007154 / GO:0007267 / GO:0008150 / GO:0009987 / GO:0023052

What I wanted to get is the following:

GO:0008150 / GO:0009987 / GO:0007154 / GO:0007267
GO:0008150 / GO:0023052 / GO:0007267

What I understood is that under the hood I'm calculating the depth of each level and using it to construct the path, this works fine when we have an element that belongs only to one branch.

SELECT (count(?mid) as ?depth) ?midId
FROM <http://rdf.ebi.ac.uk/dataset/go> 
WHERE {
    obo:GO_0032259 rdfs:subClassOf* ?mid .
    ?mid rdfs:subClassOf* ?class .
    ?mid <http://www.geneontology.org/formats/oboInOwl#id> ?midId.
}
GROUP BY ?midId
ORDER BY ?depth

The result:

depth |   midId
------|------------
1     | GO:0008150
2     | GO:0008152
3     | GO:0032259

In the second example, things are missed up and I didn't get why, in any ways I'm sure that part of the problem are terms that have the same depth/level, but I don't know how can I solve this.

depth |   midId
------|------------
2     | GO:0008150
2     | GO:0009987
2     | GO:0023052
3     | GO:0007154
6     | GO:0007267
Bilal
  • 2,883
  • 5
  • 37
  • 60
  • 3
    Impossible with SPARQL. It doesn't care about branches in a tree, it's just about pattern matching. You can't distinguish in the query between the branches, thus you can't traverse each . Note, that's not the purpose of SPARQL, it is not a graph traversal language. GraphQL, Gremlin, etc. would be much better languages for this use case. – UninformedUser Feb 20 '19 at 12:02
  • Note, I'm just referring to a solution using a single SPARQL query. Indeed, you could on the client side by using multiple queries executed iteratively to traverse the paths in the tree. – UninformedUser Feb 20 '19 at 12:05
  • @AKSW Thank you for your quick answer, I did create (using Ontology Lookup Service API) a recursive function that goes through the tree and get all the parents but it took a lot of time to execute and I'm afraid that I'll have the same problem using multiple queries as you suggested (I'm iterating through a file with +50k GO ID); I'll check out GraphQL and Gremlin. – Bilal Feb 20 '19 at 12:25
  • As your question is specific to the [Virtuoso-powered EMBL-EBI endpoint](https://www.ebi.ac.uk/rdf/services/sparql), I would suggest you bring this to the [OpenLink Community Forum](https://community.openlinksw.com) where developers of Virtuoso can assist more quickly. Virtuoso does not support GraphQL nor Gremlin, but there are likely other ways to achieve your goals. – TallTed Feb 20 '19 at 14:24
  • Also worth noting, EMBL-EBI are still running with a 7.2.4.2 build (`07.20.3217`) from August 2016, and should be encouraged to upgrade to 7.2.5.1 (`07.20.3229` from August 2018) or later. – TallTed Feb 20 '19 at 14:27
  • @Bilal yes, I understand - it's most likely that multiple queries can become slower, but at least from what I think resp. know - and clearly I might be wrong - you cannot distinguish the paths from where the ancestor nodes have been found. Do you need those paths for visualization? Maybe it's also possible to load the datasets into some other graph store? Not sure what requirements you have in you project. – UninformedUser Feb 20 '19 at 15:08
  • Thanks @TallTed, next time I'll post in the OpenLink Community Forum (sorry I forgot about it) – Bilal Feb 20 '19 at 17:15
  • 1
    `you cannot distinguish the paths from where the ancestor nodes have been found` @AKSW I totally agree with what you said. you suggestion of using GraphQL was interesting (please check out my answer below). – Bilal Feb 20 '19 at 17:15
  • @AKSW you should explain this more completely somewhere; many folks (including me) do not fully understand the distinction: `Impossible with SPARQL. It doesn't care about branches in a tree, it's just about pattern matching. You can't distinguish in the query between the branches, thus you can't traverse each . Note, that's not the purpose of SPARQL, it is not a graph traversal language. GraphQL, Gremlin, etc. would be much better languages for this use case.` – Jay Gray Feb 20 '19 at 21:46

1 Answers1

1

Thanks to @AKSW I found a decent solution using HyperGraphQL (a GraphQL interface for querying and serving linked data on the Web).

I'll leave the detailed answer here, it may help someone.

  1. I downloaded and set up HyperGraphQL download page
  2. Linked it to EBI Sparql endpoint as described in this tutorial

    The config.json file I used:

    {
        "name": "ebi-hgql",
        "schema": "ebischema.graphql",
        "server": {
            "port": 8081,
            "graphql": "/graphql",
            "graphiql": "/graphiql"
        },
        "services": [
            {
                "id": "ebi-sparql",
                "type": "SPARQLEndpointService",
                "url": "http://www.ebi.ac.uk/rdf/services/sparql",
                "graph": "http://rdf.ebi.ac.uk/dataset/go",
                "user": "",
                "password": ""
            }
        ]
    }
    

    Here's how my ebischema.graphql file looks like (Since I needed only the Class, id, label and subClassOf):

    type __Context {
        Class:          _@href(iri: "http://www.w3.org/2002/07/owl#Class")
        id:             _@href(iri: "http://www.geneontology.org/formats/oboInOwl#id")
        label:          _@href(iri: "http://www.w3.org/2000/01/rdf-schema#label")
        subClassOf:     _@href(iri: "http://www.w3.org/2000/01/rdf-schema#subClassOf")
    }
    
    type Class @service(id:"ebi-sparql") {
        id: [String] @service(id:"ebi-sparql")
        label: [String] @service(id:"ebi-sparql")
        subClassOf: [Class] @service(id:"ebi-sparql")
    }
    
  3. I started testing some simple query, but constantly getting an empty response; the answer to this issue solved my problem.

  4. Finally I constructed the query to get the tree

    Using this query:

    {
      Class_GET_BY_ID(uris:[
        "http://purl.obolibrary.org/obo/GO_0032259",
        "http://purl.obolibrary.org/obo/GO_0007267"]) {
        id
        label
        subClassOf {
          id
          label
          subClassOf {
            id
            label
          }
        }
      }
    }
    

    I got some interesting results:

    {
      "extensions": {},
      "data": {
        "@context": {
          "_type": "@type",
          "_id": "@id",
          "id": "http://www.geneontology.org/formats/oboInOwl#id",
          "label": "http://www.w3.org/2000/01/rdf-schema#label",
          "Class_GET_BY_ID": "http://hypergraphql.org/query/Class_GET_BY_ID",
          "subClassOf": "http://www.w3.org/2000/01/rdf-schema#subClassOf"
        },
        "Class_GET_BY_ID": [
          {
            "id": [
              "GO:0032259"
            ],
            "label": [
              "methylation"
            ],
            "subClassOf": [
              {
                "id": [
                  "GO:0008152"
                ],
                "label": [
                  "metabolic process"
                ],
                "subClassOf": [
                  {
                    "id": [
                      "GO:0008150"
                    ],
                    "label": [
                      "biological_process"
                    ]
                  }
                ]
              }
            ]
          },
          {
            "id": [
              "GO:0007267"
            ],
            "label": [
              "cell-cell signaling"
            ],
            "subClassOf": [
              {
                "id": [
                  "GO:0007154"
                ],
                "label": [
                  "cell communication"
                ],
                "subClassOf": [
                  {
                    "id": [
                      "GO:0009987"
                    ],
                    "label": [
                      "cellular process"
                    ]
                  }
                ]
              },
              {
                "id": [
                  "GO:0023052"
                ],
                "label": [
                  "signaling"
                ],
                "subClassOf": [
                  {
                    "id": [
                      "GO:0008150"
                    ],
                    "label": [
                      "biological_process"
                    ]
                  }
                ]
              }
            ]
          }
        ]
      },
      "errors": []
    }
    

EDIT

This was exactly what I wanted, but I noticed that I can't add another sublevel like this:

{
  Class_GET_BY_ID(uris:[
    "http://purl.obolibrary.org/obo/GO_0032259",
    "http://purl.obolibrary.org/obo/GO_0007267"]) {
    id
    label
    subClassOf {
      id
      label
      subClassOf {
        id
        label
        subClassOf {  # <--- 4th sublevel
          id
          label
        }
      }
    }
  }
}

I created a new question: Endpoint returned Content-Type: text/html which is not recognized for SELECT queries

TallTed
  • 9,069
  • 2
  • 22
  • 37
Bilal
  • 2,883
  • 5
  • 37
  • 60