0

I am trying to get Wikipedia infoxbox data from Wikidata's API, for a number of companies. For example, Deliveroo:

https://www.wikidata.org/w/api.php?action=wbgetentities&format=jsonfm&sites=enwiki&titles=Deliveroo&props=info%7Clabels%7Cdescriptions%7Cclaims&languages=en

The JSON the API returns (actually JSON embedded in HTML in this case - use format=jsonfm for pure JSON) is missing some data from the Wikipedia page like "Industry: Online food ordering, Food delivery". Is there any way to find this data with Wikidata? Also, the data that is returned uses codes in place of attribute names, for example, for the "Founded" attribute in the Wikipedia infobox, Wikidata has:

mainsnak": {
                            "snaktype": "value",
                            "property": "P571",
                            "hash": "7f617d23c9e1f8b6ce23c06baf4d3bdad9b4fbb9",
                            "datavalue": {
                                "value": {
                                    "time": "+2013-00-00T00:00:00Z",
                                    "timezone": 0,
                                    "before": 0,
                                    "after": 0,
                                    "precision": 9,
                                    "calendarmodel": "http://www.wikidata.org/entity/Q1985727"
                                },
                                "type": "time"
                            },
                            "datatype": "time"
                        },

I am guessing that "property": "P571", refers to the founded attribute, but I am not sure how to map these codes the the actual text names. Any help would be greatly appreciated.

Max888
  • 3,089
  • 24
  • 55
  • 1
    I'm not sure but think that API solution will be painful (you probably have to ask API again for translation of the coded properties to a human language). For these tasks it's much easier to build a query with [SPARQL](https://query.wikidata.org/). For example, visit [this query](https://w.wiki/Uif) and open "edit SPARQL" on the right side... it simply gets all companies in food industry and prints their location and items of operation, if available. You can get the results in JSON and other formats. –  Jun 20 '20 at 23:26
  • Thanks @PetrKajzar. I don't suppose you would be able to provide a SPARQL example of getting nicely formatted data of an infobox? I've come across SPARQL before and it is much more advanced than my skills. – Max888 Jun 20 '20 at 23:35
  • 1
    Well, maybe... what companies do you need to get? Is that only a list of some companies or do you want all companies with certain characteristics (e.g. food industry in GB)? –  Jun 21 '20 at 06:09
  • 1
    It's not necessarily the same data. See [this answer](https://stackoverflow.com/questions/33862336/how-to-extract-information-from-a-wikipedia-infobox/33862337#33862337) for more details / other options. – Tgr Jun 21 '20 at 09:44
  • @PetrKajzar I need to get all tech companies in the UK. Each one might have different fields in their infobox but I would like to collect all the data. Is it not possible to just get whatever data is available? Or do you need to know what fields there are up front and specify them in the SPARQL query? – Max888 Jun 21 '20 at 12:41
  • @Tgr Thanks, that question is actually what I used to get myself to the point of asking this question. It seems to only get you half way (for beginners like myself) - it points you to use the structured data from Wikidata or DBpedia, but doesn't provide any examples of getting the data simply formatted in JSON. I think a lot of users, like myself, will be hoping to be able to simply get all the data from the infobox in simply formatted JSON, without any RDF information included. But maybe this is naive and it is not possible to simplify RDF like this? – Max888 Jun 21 '20 at 12:47
  • 1
    I don't know any solution that would extract all the fields from the infoboxes. Some infoboxes in Wikipedia are populated from Wikidata, however some are maintained manually and have different structure (this is what @Tgr says). To get insight in what is available you can check [Deliveroo Wikidata item](https://www.wikidata.org/wiki/Q22000919). As you can see, there is no information about UK or "tech". Generally speaking, you can filter items and get values of some properties. But it is unfortunately not possible (or very difficult) to get all the values for all the companies. –  Jun 21 '20 at 13:43
  • @PetrKajzar Thanks. The best solution I have found so far is to query DBpedia with `http://dbpedia.org/data/Deliveroo.json`, and then extract only the information that I want. The more I think about it, it. makes sense that DBpedia and Wikidata won't know exactly what data is wikipedia infobox data, and will simply give you all the data they have. – Max888 Jun 21 '20 at 14:04
  • 1
    @Max888 DBPedia data is extracted from Wikipedia infoboxes (alongside some other sources). Wikidata generally isn't, although sometimes people do copy information from Wikipedia infoboxes to Wikidata, and some Wikipedia infoboxes do pull information from Wikidata. So if you want specifically the data that's in the infoboxes, you should use DBPedia. Wikidata has somewhat different information - could be better or worse, depends on your use case. – Tgr Jun 21 '20 at 15:58
  • 1
    As for getting Wikidata information, I'm not sure there's an easier way. There are all sorts of client libraries, though (e.g. [wptools](https://github.com/siznax/wptools/wiki/Wikidata)); also, [Wikidata View](https://tools.wmflabs.org/hay/wdview/) has a pretty user-friendly API, and while it is an experimental service and the API is not advertised anymore for use by others, in theory you could set it up locally, I think. – Tgr Jun 21 '20 at 16:07

1 Answers1

1

Wikidata is not guaranteed to contain all data Wikipedia infoboxes do. Many Wikipedia communities decided to cosume Wikidata in their infoboxes, but not all of them (notably, the English Wikipedia is known for not using Wikidata data). Even Wikipedias which do use data from Wikidata, they don't need to use all the data, and they can still decide to fill some of the data manually.

If you want to use only data from the infoboxes, perhaps https://dbpedia.org is a better option?

Martin Urbanec
  • 426
  • 4
  • 11