1

I have set up a Virtuoso server for hosting Freebase data (version 07.20.3217, built Jan 5 2017; I really appreciate if you can have a try).

Let's consider this scenario: find the largest location (probably a county, denoted by ?var1) in Wisconsin State (fb:m.0824r), where ?var1 contains at least one location (denoted by ?var2) with the type fb:place_with_neighborhoods.

I wrote the SPARQL query as follows:

PREFIX fb: <http://rdf.freebase.com/ns/> 
SELECT DISTINCT ?var1 ?var2 ?v2_name WHERE {
             fb:m.0824r  fb:location.location.contains  ?var1 . 
             ?var1       fb:location.location.contains  ?var2 . 
             ?var2       fb:type.object.type            fb:location.place_with_neighborhoods . 
             ?var1       fb:location.location.area      ?area .
  OPTIONAL { ?var2       fb:type.object.name            ?v2_name } .
} ORDER BY DESC(?area) 
LIMIT 1

Unfortunately, the Virtuoso engine fail to return the query result for more than one hour.

I tried some simpler queries, which could produce results in less than one second:

PREFIX fb: <http://rdf.freebase.com/ns/> 
SELECT DISTINCT ?var1 ?var2 ?v2_name WHERE {
             fb:m.0824r  fb:location.location.contains  ?var1 . 
             ?var1       fb:location.location.contains  ?var2 . 
             ?var2       fb:type.object.type            fb:location.place_with_neighborhoods . 
  OPTIONAL { ?var2       fb:type.object.name            ?v2_name } .    
}
# Remove the area-related information with ?var1
# Returns ONLY ONE result in 0.05s.

and,

PREFIX fb: <http://rdf.freebase.com/ns/> 
SELECT DISTINCT ?var1 ?var2 ?v2_name ?area WHERE {
             fb:m.0824r  fb:location.location.contains  ?var1 . 
             ?var1       fb:location.location.contains  ?var2 . 
             ?var1       fb:location.location.area      ?area .
  OPTIONAL { ?var2       fb:type.object.name            ?v2_name } .
}
# Remove the type limitation of ?var2
# Returns ~7000 results in ~1s.

Given the results of those simpler queries, I'm really confused which step brought the performance issue. Is there anybody who can give me some advice? Thank you so much!

TallTed
  • 9,069
  • 2
  • 22
  • 37
Kangqi Luo
  • 11
  • 2
  • 1
    The result will be empty as there is no relation `fb:location.location.area` for the single result returned when removing the triple pattern with this property. The only property for `http://rdf.freebase.com/ns/m.05_jhl` is the `fb:location.location.contains` property. – UninformedUser Feb 11 '18 at 09:14
  • Remove `OPTIONAL`, enable `&explain=on`. – Stanislav Kralin Feb 11 '18 at 10:04
  • @AKSW Yes you are right, I never found this point!So the problem becomes more interesting: how could virtuoso just hang there for so much time, while the query apparently should return empty ? – Kangqi Luo Feb 11 '18 at 10:57
  • @StanislavKralin Awesome! The query works after the removal, and I'm really curious why. And where can I set "&explain=on" ? – Kangqi Luo Feb 11 '18 at 11:01
  • @KangqiLuo, in general, `OPTIONAL` is slow. As to `explain`, perhaps this option is DBpedia specific, but one can call the `explain()` function somewhere in console. – Stanislav Kralin Feb 11 '18 at 12:51
  • Yes, OPTIONAL is basically a left-join. In general, query optimization can be a hard task, sometimes it helps to reorder the triple patterns (only the ones that are commutative). Some triple stores also do have some `hint` features, not sure if Virtuoso has something similar. But, you should ask on the mailing list or open a Github issue. The dataset seems to be small, especially the intermediate results. Maybe they could use your example to improve the query optimizer. – UninformedUser Feb 11 '18 at 14:16
  • 1
    You can see how to analyze the query [here](http://vos.openlinksw.com/owiki/wiki/VOS/VirtRDFPerformanceTuning#Erroneous%20Cost%20Estimates%20and%20Explicit%20Join%20Order) But to be honest, it's pretty hard to understand for non-experiences user resp. developer. But you could attach the output in your mailing list thread or Github issue then – UninformedUser Feb 11 '18 at 14:22

1 Answers1

2

As noted on the issue you raised on the project --

There appears to be a query plan issue with OPTIONAL when the rest of the query produces no solution, as removing only that clause from your initial query brings results near instantly.

Removing the ?var1 fb:location.location.area ?area pattern (and therefore, the ORDER BY DESC(?area)), which is what reduces the solution set to zero, likewise brings near instant results.

This issue will be raised with Development for their analysis.

TallTed
  • 9,069
  • 2
  • 22
  • 37