2

So I'm currently working on a project where Im using data that I'm getting from Wikidata and I noticed a lot of duplicate elements in my database. Reason for that is that I'm receiving population numbers for different points in time.

I've read that Wikidata has rankings for statements with multipile values and for the population property that seems to be the most recent value-which is true for about 99.9% of the entries. What I don't understand is why it doesn't work for the other 0.1%.

One example would be: Wikidata query

The same happens for example with the elements

and I have no idea why.

I've already tried the solution from this topic but it didn't change the result.

Any ideas?


Edit based on the filter option from the thread: wikidata query 2

Edit 2: Full query

Stanislav Kralin
  • 11,070
  • 4
  • 35
  • 58
Wajora
  • 43
  • 4
  • Related: https://stackoverflow.com/q/49066390/7879193 – Stanislav Kralin Aug 17 '18 at 15:51
  • I tried that already and that didnt work – Wajora Aug 17 '18 at 16:43
  • 1
    @Wajora "didn't work" is not a good description for what you tried exactly, what you got and what you expected...please provide details to reproduce it – UninformedUser Aug 17 '18 at 17:51
  • @sIm sorry, I edited the post and added what I tried based on your suggestion. if I understood your suggestion correct this is what shoudlve fixed it but it still selects duplicates – Wajora Aug 17 '18 at 18:10
  • 1
    Your revised query includes `filter (?date > ?population)` which seems nonsensical. I suggest you also rework your indentation to make the query structure clearer. – TallTed Aug 17 '18 at 18:19

1 Answers1

1

Some Wikidata properties are processed by PreferentialBot (source code).

In short, the bot makes the most recent statements preferred, hence making them truthy.

Sometimes the bot does not process statements for a property. For example, the bot doesn't process items that have statements without respective qualifiers.

In your particular case:

SELECT DISTINCT ?city ?cityLabel ?population ?date ?rank WHERE {
  VALUES (?settlement) {(wd:Q515) (wd:Q15284)}
  VALUES (?city) {(wd:Q1658752)}
  ?city wdt:P31/wdt:P279* ?settlement . 
  ?city p:P1082 ?statement .
  ?statement ps:P1082 ?population .
  ?statement wikibase:rank ?rank
  OPTIONAL { ?statement pq:P585 ?date }  
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }   
} ORDER by ?date

Try it

Results:

+-------------+-----------+------------+----------------------+---------------------+
|    city     | cityLabel | population |        date          |         rank        |
+-------------+-----------+------------+----------------------+---------------------+
| wd:Q1658752 | Kagan     |      86745 |                      | wikibase:NormalRank |
| wd:Q1658752 | Kagan     |      17656 | 1939-01-01T00:00:00Z | wikibase:NormalRank |
| wd:Q1658752 | Kagan     |      21103 | 1959-01-01T00:00:00Z | wikibase:NormalRank |
| wd:Q1658752 | Kagan     |      34117 | 1970-01-01T00:00:00Z | wikibase:NormalRank |
| wd:Q1658752 | Kagan     |      41565 | 1979-01-01T00:00:00Z | wikibase:NormalRank |
| wd:Q1658752 | Kagan     |      48054 | 1989-01-01T00:00:00Z | wikibase:NormalRank |
+-------------+-----------+------------+----------------------+---------------------+

Would you prefer the most recent statement or the "eternal" one?

This is how you can find the most recent population:

SELECT DISTINCT ?city ?cityLabel ?population WHERE {
  VALUES (?settlement) {(wd:Q515) (wd:Q15284)}
  VALUES (?city) {(wd:Q1658752)}
  ?city wdt:P31/wdt:P279* ?settlement . 
  ?city p:P1082 [ ps:P1082 ?population; pq:P585 ?date1 ]  
  FILTER NOT EXISTS {
    ?city p:P1082 [ pq:P585 ?date2 ]
    FILTER (?date2 > ?date1) }
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }   
}

Try it

This is how you can find the "eternal" one:

SELECT DISTINCT ?city ?cityLabel ?population WHERE {
  VALUES (?settlement) {(wd:Q515) (wd:Q15284)}
  VALUES (?city) {(wd:Q1658752)}
  ?city wdt:P31/wdt:P279* ?settlement . 
  ?city p:P1082 ?statement .
  ?statement ps:P1082 ?population .
  FILTER NOT EXISTS {?statement pq:P585 []}
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }   
}

Try it


In fact, almost 70% (not 0.1%) of entries with the P1082 property do not have preferred statements for this property. Your should rather mean entries with the P1082 property that have more than one truthy statement for this property. Recall that:

Truthy statements represent statements that have the best non-deprecated rank for given property. Namely, if there is a preferred statement for property P2, then only preferred statements for P2 will be considered truthy. Otherwise, all normal-rank statements for P2 are considered truthy.

And yes, about 0.5% entries that have P1082-statements have two or more truthy P1082-statements.

Stanislav Kralin
  • 11,070
  • 4
  • 35
  • 58
  • 1
    I realized that I still didnt quite understand why the preferentialbot doesnt process the date in this case. You said it doesnt process items that have statements without respective qualifiers, but isnt the point in time date a qualifier in this case? – Wajora Aug 22 '18 at 14:37
  • Statement with value 86,745 has no P585 qualifier: https://www.wikidata.org/wiki/Q1658752#Q1658752$BADE429F-21B8-44EE-98BC-F5A2ABEE4988 – Stanislav Kralin Aug 22 '18 at 14:41
  • @Wajora, hence, the bot doesn't process all statements with P1082 property. You could edit my answer in order to make that more clear, my English is poor... – Stanislav Kralin Aug 22 '18 at 15:16
  • For comparison, replace `wd:Q1658752` with `wd:Q887` in my first query. – Stanislav Kralin Aug 22 '18 at 15:24