Identify entity of a Wikipedia page

Question

My question is related to a similar question/comment which unfortunately never received an answer.

Given a list of multiple Wikipedia pages, e.g.:

how can I find out what type of entity these articles refer to. i.e. ideally I would want something on a higher level e.g. person, movie, animal etc.

My best guess so far was the Wikidata API using SPARQL to move back the instance_of or subclass tree. However, this did not lead to meaningful results.

SELECT ?lemma ?item ?itemLabel ?itemDescription ?instance ?instanceLabel ?subclassLabel WHERE {
  VALUES ?lemma {
    "Donald Trump"@en
    "The Matrix"@en
    "Tiger" @en
  }
  ?sitelink schema:about ?item;
    schema:isPartOf <https://en.wikipedia.org/>;
    schema:name ?lemma.
  ?item wdt:P31* ?instance.
  ?item wdt:P279* ?subclass.
  SERVICE wikibase:label { 
    bd:serviceParam wikibase:language "en,da,sv".}
}

The result can be seen here: https://w.wiki/ZmQ

One option would of course also be to look at the itemDescription, but I'm afraid that this is too granular to build meaningful groups from larger lists and count frequencies later on. Does anyone have a hint/idea on how to get more general entity categories? Maybe also from the mediawiki API?

Any input would be highly appreciated!

Matthias Winkelmann · Accepted Answer · 2020-08-18T16:52:29.673

1

Here are three possibilities, side-by-side:

SELECT ?lemma ?item (GROUP_CONCAT(DISTINCT ?instanceLabel; SEPARATOR = " ") AS ?a) (GROUP_CONCAT(DISTINCT ?subclassLabel; SEPARATOR = " ") AS ?b) (GROUP_CONCAT(DISTINCT ?isaLabel; SEPARATOR = " ") AS ?c) WHERE {
  VALUES ?lemma {
    "Donald Trump"@en
    "The Matrix"@en
    "Tiger"@en
  }
  ?sitelink schema:about ?item;
    schema:isPartOf <https://en.wikipedia.org/>;
    schema:name ?lemma.
  OPTIONAL { ?item (wdt:P31/(wdt:P279*)) ?instance. }
  OPTIONAL { ?item wdt:P279 ?subclass. }
  OPTIONAL { ?item wdt:P31 ?isa. }
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en,da,sv".
    ?instance rdfs:label ?instanceLabel.
    ?subclass rdfs:label ?subclassLabel.
    ?isa rdfs:label ?isaLabel.
  }
    # Here, you could add: FILTER(?instanceLabel in ("mammal"@en, "movie"@en, "musical"@en (and so on...)))
}
GROUP BY ?lemma ?item

Live here.

If you're looking at labels such as "film" and "mammal", i. e. a couple dozen at most, you could explicitly list them in order of preference, then use the first one that occurs.

Note that you may be running into this bug: https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial#wikibase:Label_and_aggregations_bug

edited Aug 18 '20 at 16:52

answered Aug 18 '20 at 02:18

Matthias Winkelmann

15,870
7
64
76

Wow, much more helpful than what I could come up with, thanks! This looks like a very advanced query ... could you explain what the part between SELECT and VALUE (i.e. group-concat ... ) does? – meier_flo Aug 18 '20 at 07:14
also the list of prefered labels could be done with a filter? – meier_flo Aug 18 '20 at 07:21
1

it's an aggregate function similar to SQL language. You create groups of bindings (aka rows in SQL) and then you apply an aggregate which always does return exactly one binding (row) per group - that's it. But I don't know how this helps you here. I thought you wanted to get rid of those meta-classes etc. - I mean, the aggregation could have also been done in the client/application code, or not? – UninformedUser Aug 18 '20 at 09:57
1

Yes, it's pretty close to your original version. The OPTIONAL helps in some cases, and I believe that bug was the major problem. I had something with FILTER as well, will add it to the answer. – Matthias Winkelmann Aug 18 '20 at 16:50
Thank you both for more clarification on the matter! – meier_flo Aug 19 '20 at 10:45

Identify entity of a Wikipedia page

1 Answers1