2

After some FILTER I end up having 2 sets, lets say A1 and A2 and I want to SELECT only those elements in A1 that do not appear in A2. I was trying to use MINUS but without success.

When we have something like:

MINUS { ?s foaf:givenName "Bob" }

We need to know in advance what we want to subtract. In my case, I don't know any property like the `foaf:givenName1, except that they belong to set A2. (That is a property thought).

I am confused. Any ideas?

EDIT WITH EXAMPLE FOR CLARITY:

SELECT DISTINCT ?x ?y WHERE { 

?x swrc:listAuthor ?y.
?x swrc:author ?w.
FILTER (!regex(?y, " and ")).
?a swrc:listAuthor ?b.
?a swrc:author ?c.
FILTER regex(?b, " and ").
FILTER(?c != ?w).}

So what I am trying to do with this is the following. With listAuthor I get the authors in a string like "John Doe and John Nipper". Taking advantage of this format I want to have the authors that wrote a paper alone (no "and" in their authorList) . The first 3 lines are enough for that. But there are some authors that wrote 2 papers, 1 alone and 1 with co-authors. I try to somehow subtract them from the first ones. Any ideas?

Paramar
  • 287
  • 1
  • 5
  • 22
  • 1
    Please show (a minimal version) of the kind of query that you're talking about. Except for aggregates, SPARQL doesn't really deal with _sets_ of values, so it's not quite clear what you're asking. "After some FILTER I end up having 2 sets" doesn't make sense; `filter` just removes _rows_ from the result set; it doesn't partition a set. What are these sets that you're talking about? An example of your data and query would probably make this much clearer. – Joshua Taylor Jan 27 '14 at 22:09
  • I tried to make it clear now – Paramar Jan 27 '14 at 22:29
  • Thanks, I think it's clearer now (if I have more questions, I'll ask, though :)). – Joshua Taylor Jan 27 '14 at 22:50
  • You said you're trying to get the authors "I want to have the authors that wrote a paper alone ", but you're selecting `?x` and `?y` where `x?` is a paper and `?y` is the string "author1 and author2". Are you trying to get papers that have a single author? – Joshua Taylor Jan 27 '14 at 23:08
  • On second though, looking at your query again, it looks like you're trying to get authors who have only written papers that don't have coauthors; that is authors (and their papers) who have _never_ coauthored a paper with someone else, is that right? – Joshua Taylor Jan 27 '14 at 23:10

1 Answers1

6

Finding authors with no coäuthors

If I understand your question correctly, you're trying to ask for authors (and their papers)who have never coauthored a paper with someone else. You don't actually need to match the author list to do this, if the papers are related to the authors by the :author property. These problems are always much easier if we have some data to work with, so consider this data:

@prefix : <http://stackoverflow.com/q/21391444/1281433/> .

:p1 :author :a, :b .
:p2 :author :a .
:p3 :author :b, :c .
:p4 :author :d .

A has written a paper with B, and also alone. B has written a paper with A, and also with C. C has written a paper with B. D has written a paper alone.

We can use a query like this to find all the authors who have never coauthored a paper (in this case, D):

prefix : <http://stackoverflow.com/q/21391444/1281433/>

select ?author ?paper where {
  ?paper :author ?author .
  filter not exists { 
    ?paper2 :author ?author, ?otherAuthor .
    filter ( ?author != ?otherAuthor )
  }
}

This corresponds to the English:

Find papers with authors such that there is no paper by that author with another author.

We get the expected results:

------------------
| author | paper |
==================
| :d     | :p4   |
------------------

If you still wanted to pick and exclude based on regular expressions in the author list string, you can do that with

prefix : <http://stackoverflow.com/q/21391444/1281433/>

select ?author ?paper where {
  # find authors of papers with no coauthors
  ?paper :author ?author ; :listAuthor ?list .
  filter(!regex(?list," and "))

  # and remove those that coauthored some paper
  filter not exists { 
    ?paper2 :author ?author ; :listAuthor ?list2 .
    filter(regex(?list2," and "))
  }
}

Debugging the original query

The original query can be abbreviated as the following, which is exactly the same, except for some syntactic sugar.

SELECT DISTINCT ?x ?y WHERE { 
  ?x swrc:listAuthor ?y ; swrc:author ?w.
  FILTER (!regex(?y, " and ")).

  ?a swrc:listAuthor ?b ; swrc:author ?c.
  FILTER regex(?b, " and ").

  FILTER(?c != ?w).
}

Aside from the filter at the end, the pattern on ?x, ?y and ?w is completely separate from the pattern on ?a, ?b, and ?c. From the first pattern, you'll get one binding for each author of each paper with just one author (which means one binding for each paper with just one author). From the second pattern, you'll get one binding for each author of each paper with multiple authors. Then you're essentially taking the cartesian product of these two sets of (author,paper) pairs, to get a bindings of the form (paper1,author1,paper2,author2), and then the final filter says "remove any bindings where author1 is the same as author2.

Consider what this means for the data I gave above, but let's look just at papers :p1 and :p2. Since :a authored :p1 alone, we'll have :a as ?w and :p1 as ?x:

?x   ?w
-------
:p1  :a

However, since :a also authored paper :p2 with :b, we'll have some rows for ?a and ?c:

?a   ?c
-------
:p2  :a
:p2  :b

Now the cartesian product is:

?x  ?w  ?a  ?c
--------------
:p1 :a  :p2 :a
:p1 :a  :p2 :b

The filter removes the first of these rows, leaving us with

?x  ?w  ?a  ?c
--------------
:p1 :a  :p2 :b

and this has :a as ?w, even though :a coäuthored papers with someone. In general:

  • Each ?x is a paper with a single author (?w).
  • Each ?w is an author who has written a paper alone (?x).
  • Each ?a is paper with multiple authors, one of which is ?c, and one of which (?w) wrote a paper (?x) alone.
  • Each ?c is an author who has coauthored a paper (?a) with someone (?w) who has written a paper alone (?x).
Joshua Taylor
  • 84,998
  • 9
  • 154
  • 353
  • That was exactly what I meant. I will go check your code and return to upvote you soon enough – Paramar Jan 27 '14 at 23:45
  • It is correct . Could you however tell me if there is a way of correcting what I wrote and retreive the same results, while following the filter like logic with regex that I tried to implement? I understand why your code is correct, but it is still not clear to me why mine is not. – Paramar Jan 28 '14 at 00:01
  • The first query works but the second doesn't return what I expected (different results from the first query). I get a name of an author that wrote two papers, one alone but also another one in collaboration with someone else (the solo paper appears in the results). This shouldn't have happened. Just for the record I checked the authorlist and it has "and" inside it. – Paramar Jan 29 '14 at 22:45
  • I'm sorry, I'm not sure what you mean by "the first query" and "the second [query]". You wrote one query and I wrote one query. Also, what data are you running on? I didn't provide data with author lists, so I don't have access to the data you're talking about. We can't diagnose problems in queries that we don't have on data that we don't have… – Joshua Taylor Jan 29 '14 at 22:53
  • By first query I meant yours, by second mine (more like mine after you debugged it). As I said before and you cite in the beggining of your answer "ask for authors (and their papers)who have never coauthored a paper with someone else." . Your query returns exactly those authors, when I run it on my data. But the second (mine debugged) returns also some that have co-authored papers, while being single authors in some other paper too. Is it clearer now? – Paramar Jan 29 '14 at 23:38
  • @Paramar Yes, I think I understand you now. As I explained in the answer, "Each ?w is an author who has written at least one paper alone [and one such paper is ?x]." The query does _not_ restrict `?w` to authors who have _only_ written papers alone. – Joshua Taylor Jan 30 '14 at 00:59
  • I totally agree, but it should have restricted it. That is practically, the reason I asked, because I couldn't find how to restrict it by myself. The query you wrote guarantees that. But when you debugged mine, that was the case no more. Of course the question stays the same. We want to restrict to authors who have only written papers alone. – Paramar Jan 30 '14 at 01:22
  • @Paramar I wasn't trying to fix your initial query in that explanation, I was trying to explain what it was actually returning and why. – Joshua Taylor Jan 30 '14 at 02:20
  • Thank you. Any idea on how to fix it though? – Paramar Jan 30 '14 at 13:48
  • @Paramar In short, you more or less need a query like the one I provided. The result you're looking for is of the form "Every X with property P EXCEPT Xs that also have property Q", and you do that with `{ …x has property p… filter not exists { …x has property q… }`. Your original query is taking the Cartesian product of Xs that have property P and Ys that have property Q and removing the rows where X and Y are the same, but that still leaves results that have Xs that have property Q. The way to fix this is by using a query like mine. :) Of course, you can still make P and Q regex-based. – Joshua Taylor Jan 30 '14 at 14:49
  • @Paramar I did add a (untested, because I don't have your data) query that shows how you could write one like mine, but using the regular expressions to match. – Joshua Taylor Jan 30 '14 at 14:53
  • Thank you. The results match exactly those I expected – Paramar Jan 30 '14 at 20:28