0

I have a list of wikipedia users and the articles that they edited.
I'm trying to build a hierarchical profile for each one of them.

The problem is i'm struggling to get parent categories for each article.
What I want is for example is for an article about Pizza to get "dishes" or "food".
I'm using jena and Yago and a simple SPARQL query who looks like this:

String sparqlQueryString = "BASE <http://yago-knowledge.org/resource/>"
                + "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> "
                + "SELECT ?supercat WHERE { "
                + "<" + child + ">" + " rdf:type ?supercat . "
                + "}";

(Where child here is the article)
So i wanted to ask if somebody knew how to get a correct parent category from that. And also, if it's possible or if anyone has ideas about ordering all the titles of the articles and their parents into a hierarchic user interests profile.

svick
  • 236,525
  • 50
  • 385
  • 514
paskun
  • 89
  • 1
  • 6
  • That's not a SPARQL query; that's Java code that *might* construct a SPARQL query, or might construct something else, depending on what the value of `child` is. – Joshua Taylor Nov 03 '14 at 17:40
  • An article isn't a category, so it doesn't have a parent category. Do you mean that you want the categories to which an article belongs? – Joshua Taylor Nov 03 '14 at 17:41
  • Yeah, that's a string constructing a SPARQL query, but you know what I meant. and yes, i want to get the parent category of an article but i'm getting a lot of parent categories, and i would like to know if it's possible to get better results. Like simply for an article like Pizza get "food" or "dishes". Thanks – paskun Nov 03 '14 at 17:46
  • yes, except if `child` is coming from user input, what happens when the value of `child` contains a space? or is something that contains SPARQL code? Making queries like this is subject to injection attacks, just like SQL. If you're using Jena, it's a good idea to use parameterized strings (e.g., see [Using ParameterizedSparqlStrings in SELECT queries](http://stackoverflow.com/a/16739846/1281433)). – Joshua Taylor Nov 03 '14 at 17:49

1 Answers1

1

It's not exactly clear what you're asking. A category would have super-categories and sub-categories, and an article belongs to a categories, but an article doesn't have parent categories. If you look at the HTML rendering of a DBpedia resource, you can see that its categories are values of the dcterms:subject property. E.g., at dbpedia:Pizza, you can see

  • dcterms:subject
    • category:Flatbreads
    • category:Greek_inventions
    • category:Italian_cuisine
    • category:Italian_inventions
    • category:Mediterranean_cuisine
    • category:Pizza
    • category:World_cuisine

So, you can use a query like this to retrieve those values:

select ?category { dbpedia:Pizza dcterms:subject ?category }

SPARQL results

Now, if you have a category, e.g., category:Flatbreads, and you actually want the its supercategories, you can see that they're connected by the skos:broader property. So:

select ?supercategory { category:Flatbreads skos:broader ?supercategory }

SPARQL results

Joshua Taylor
  • 84,998
  • 9
  • 154
  • 353
  • Thanks. that helps a bit. However, my problem is that I have a lot of articles and I want to get one meaningful supercategory for each one (in an automated way that would be great) so I can construct something like a hierarchical tree from all that. I don't know if i'm clear enough. Don't hesitate to ask if not. And thanks so much for your time and your help – paskun Nov 03 '14 at 18:21
  • What do you mean by supercategory? Articles belong to categories, and categories may have supercategories, but articles themselves don't have supercategories. What does "meaningful" mean here? Stack Overflow is a great place for specific technical questions, but can't really help define "meaningful"; that's more tied to you specific application. If you can quantitatively define "meaningful", though, we can probably come up with a SPARQL query to retrieve it. – Joshua Taylor Nov 03 '14 at 18:46
  • @paskun You could, for instance, retrieve the category that has the largest number of articles, but is that necessarily meaningful? Maybe, but some categories aren't all that helpful (e.g., if there were a category "articles with the word 'the'", then just about *every* article would belong to it). – Joshua Taylor Nov 03 '14 at 18:47
  • what i meant by "meaningful" is that it would be a parent category that generalizes this article. For example "Football" could be parent of "Fifa world cup". The main objective of all this, is to construct a tree of interests based on the articles that a wikipedia user has edited. – paskun Nov 03 '14 at 18:53
  • 2
    @paskun My point is that FIFA World Cup has *five* categories (and football isn't one of them): FIFA World Cup; FIFA competitions; World championships; Recurring sporting events established in 1930; Quadrennial sporting events. Do you have any specific way to say which *one* of those is *the* meaningful, or *the most* meaningful, one? If you don't have a specific way of deciding, then you certainly won't be able to have code that does it; software can only do things that we tell it to, after all. – Joshua Taylor Nov 03 '14 at 19:12
  • I noticed that in the results that i get the word football is the most frequent one. I think taking the most frequent term could be a solution. However, I may need a way to automate the checking of these frequent terms. I'm only supposing now, but maybe there is a way to check on the wikipedia graph that you can go from football to "2014 Fifa World Cup" and maybe i can use a condition concerning the number of nodes that i have to travel to get there to get better results. – paskun Nov 03 '14 at 19:30
  • @paskun **"I noticed that in the results that i get the word football is the most frequent one."** I'm not sure what results you're talking about. The only query you've shown is `dbpedia:FIFA_World_Cup rdf:type ?class`, and I don't see "football" in any of the values of rdf:type in the [HTML rendering](http://dbpedia.org/page/FIFA_World_Cup). In what results does it appear a lot? – Joshua Taylor Nov 04 '14 at 00:15
  • in the query "dbpedia:2014_FIFA_World_Cup rdf:type ?class". I guess we weren't searching the same thing. My base article was "2014_FIFA_World_Cup" and yours was "FIFA_World_Cup". I guess you opened my eyes to another problem. Although considering that the word FIFA is frequent in that query, i guess it can be used or at least used as a base while trying to find a more "generic" term for that, if possible. – paskun Nov 04 '14 at 04:03
  • @JoshuaTaylor Please let me know if you know an answer for this https://stackoverflow.com/questions/54625493/how-to-group-wikipedia-categories-in-python Thank you :) – EmJ Feb 15 '19 at 07:25