0

I have the following psql table. In reality, it has roughly 2 billion rows.

 id  word      lemma     pos              textid  country_genre     
 1  Stuffing   stuff      vvg             190568  AN         
 2  her        her        appge           190568  AN         
 3  key        key        nn1             190568  AN         
 4  into       into       ii              190568  AN         
 5  the        the        at              190568  AN         
 6  lock       lock       nn1             190568  AN         
 7  she        she        appge           190568  AN         
 8  pushed     push       vvd             190568  AN         
 9  her        her        appge           190568  AN         
10  way        way        nn1             190568  AN         
11  into       into       ii              190568  AN         
12  the        the        appge           190568  AN         
13  house      house      nn1             190568  AN         
14  .                     .               190568  AN         
15  She        she        appge           190568  AN         
16  had        have       vhd             190568  AN         
17  also       also       rr              190568  AN         
18  cajoled    cajole     vvd             190568  AN         
19  her        her        appge           190568  AN         
20  way        way        nn1             190568  AN         
21  into       into       ii              190568  AN         
22  the        the        at              190568  AN         
23  home       home       nn1             190568  AN         
24  .                     .               190568  AN         
..  ...        ...        ..              ...     ..

I would like to create the following table, which shows all "way"-idioms with the words side-by-side and some data from the columns "country_genre", "lemma" and "pos".

country_genre word   word       word       lemma      pos        word       word     word       word       word       lemma      pos        word       word       
AN         lock   she        pushed     push       vvd        her        way      into       the        house      house      nn1        .          she
AN         had    also       cajoled    cajole     vvd        her        way      into       the        home       home       nn1        .          A          
AN         tried  to         force      force      vvi        her        way      into       the        palace     palace     nn1        ,          officials  

I use the following code (thanks to Bohemian: https://stackoverflow.com/a/47496945/3957383!):

copy(

 SELECT
   c1.id, c1.country_genre, c1.textid, c1.wordid, c1.word,  c2.word, c3.word,  c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word

 FROM

 orderedflatcorpus AS c1
 JOIN orderedflatcorpus AS c2 ON c1.id + 1 = c2.id
 JOIN orderedflatcorpus AS c3 ON c1.id + 2 = c3.id 
 JOIN orderedflatcorpus AS c4 ON c1.id + 3 = c4.id
 JOIN orderedflatcorpus AS c5 ON c1.id + 4 = c5.id
 JOIN orderedflatcorpus AS c6 ON c1.id + 5 = c6.id
 JOIN orderedflatcorpus AS c7 ON c1.id + 6 = c7.id
 JOIN orderedflatcorpus AS c8 ON c1.id + 7 = c8.id
 JOIN orderedflatcorpus AS c9 ON c1.id + 8 = c9.id
 JOIN orderedflatcorpus AS c10 ON c1.id + 9 = c10.id
 JOIN orderedflatcorpus AS c11 ON c1.id + 10 = c11.id

 WHERE

 c4.pos LIKE 'vv%'
 AND c5.pos = 'appge'
 AND c6.word = 'way'
 AND c7.pos LIKE 'i%'
 AND c8.word = 'the'
 AND c9.pos LIKE 'n%'
 )

 TO

 '/home/postgres/Results/OUTPUT.csv'
 DELIMITER E'\t'
 csv header;

This query returns 18706 relevant constructions.

However, if I use the following code, which extracts more context (21 instead of 11 words) but is otherwise equal to the previous one, something worrying happens: I get only 18555 relevant constructions.

 copy(
 SELECT c1.id, c1.country_genre, c1.textid, c1.wordid, c1.word, c1.pos, c2.word, c2.pos, c3.word, c3.pos, c4.word, c4.pos, c5.word, c5.pos, c6.word, c6.pos, 
 c7.word, c7.pos, c8.word, c8.pos, c8.lemma, c9.word, c9.pos, c10.word, c10.pos, c11.word, c11.pos, c12.word, c12.pos, c13.word, c13.pos, c13.lemma, c14.word, 
 c14.pos, c15.word, c15.pos, c16.word, c16.pos, c17.word, c17.pos, c18.word, c18.pos, c19.word, c19.pos, c20.word, c20.pos, c21.word, c21.pos 

 FROM 

 orderedflatcorpus AS c1 
 JOIN orderedflatcorpus AS c2 ON c1.id + 1 = c2.id 
 JOIN orderedflatcorpus AS c3 ON c1.id + 2 = c3.id 
 JOIN orderedflatcorpus AS c4 ON c1.id + 3 = c4.id 
 JOIN orderedflatcorpus AS c5 ON c1.id + 4 = c5.id 
 JOIN orderedflatcorpus AS c6 ON c1.id + 5 = c6.id 
 JOIN orderedflatcorpus AS c7 ON c1.id + 6 = c7.id 
 JOIN orderedflatcorpus AS c8 ON c1.id + 7 = c8.id 
 JOIN orderedflatcorpus AS c9 ON c1.id + 8 = c9.id 
 JOIN orderedflatcorpus AS c10 ON c1.id + 9 = c10.id 
 JOIN orderedflatcorpus AS c11 ON c1.id + 10 = c11.id 
 JOIN orderedflatcorpus AS c12 ON c1.id + 11 = c12.id 
 JOIN orderedflatcorpus AS c13 ON c1.id + 12 = c13.id 
 JOIN orderedflatcorpus AS c14 ON c1.id + 13 = c14.id 
 JOIN orderedflatcorpus AS c15 ON c1.id + 14 = c15.id 
 JOIN orderedflatcorpus AS c16 ON c1.id + 15 = c16.id 
 JOIN orderedflatcorpus AS c17 ON c1.id + 16 = c17.id 
 JOIN orderedflatcorpus AS c18 ON c1.id + 17 = c18.id 
 JOIN orderedflatcorpus AS c19 ON c1.id + 18 = c19.id 
 JOIN orderedflatcorpus AS c20 ON c1.id + 19 = c20.id 
 JOIN orderedflatcorpus AS c21 ON c1.id + 20 = c21.id 

 WHERE 

 c8.pos LIKE 'vv%' 
 AND c9.pos = 'appge' 
 AND c10.word = 'way' 
 AND c11.pos LIKE 'i%' 
 AND c12.word = 'the' 
 AND c13.pos LIKE 'n%' 
 ) 
 TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header;

I have looked at the lines which are missing in the second query but I cannot detect any pattern in what has been left out.

Does any one have an idea what might be going on here? Thanks!

Znusgy
  • 37
  • 1
  • 5
  • If this is a continuation to a previous question, then you should at the very least provide that link. Better yet, make this question stand on its own two feet by actually telling us what you are doing here. – Tim Biegeleisen Dec 17 '17 at 08:22
  • Thanks for the hint, I’ve just added a link! – Znusgy Dec 17 '17 at 08:28

2 Answers2

0

In Postgres JOIN is equivalent to INNER JOIN. See this for example.

In other words your query requires match with all ON predicates to have the result. Adding more JOINS increases also ON predicates that could not match thus decreasing the amount of results.

Try to change JOIN to LEFT JOIN (aka LEFT OUTER JOIN) to see if you get more results.

Without seeing your example data and example of what is missing from the result set it is hard to say what else could be the problem.

pirho
  • 11,565
  • 12
  • 43
  • 70
0

Pirho's answer is correct -- as far as it goes. However, just changing the joins to outer joins may not solve your problem.

First, the where conditions for the two queries are different, so I have no reason to assume that they will return the same result set.

The second query is looking for exactly 21 words in a row -- not 19 words, not 13 words. So if the pattern is at the end of a document, it won't be found.

Based on your conditions, there are 13 words that need to appear in a row (to get to the count "13" for your conditions. Then, I am guessing, an optional 7 rows. This requires a mixture of inner and outer joins:

FROM orderedflatcorpus c1 JOIN
     orderedflatcorpus c2
     ON c1.id + 1 = c2.id JOIN
     orderedflatcorpus c3
     ON c1.id + 2 = c3.id JOIN
     orderedflatcorpus c4
     ON c1.id + 3 = c4.id JOIN
     orderedflatcorpus c5
     ON c1.id + 4 = c5.id JOIN
     orderedflatcorpus c6
     ON c1.id + 5 = c6.id JOIN
     orderedflatcorpus c7
     ON c1.id + 6 = c7.id JOIN
     orderedflatcorpus c8
     ON c1.id + 7 = c8.id JOIN
     orderedflatcorpus c9
     ON c1.id + 8 = c9.id JOIN
     orderedflatcorpus c10
     ON c1.id + 9 = c10.id JOIN
     orderedflatcorpus c11
     ON c1.id + 10 = c11.id JOIN
     orderedflatcorpus c12
     ON c1.id + 11 = c12.id JOIN
     orderedflatcorpus c13
     ON c1.id + 12 = c13.id 
     orderedflatcorpus c14
     ON c1.id + 13 = c14.id LEFT JOIN
     orderedflatcorpus c15
     ON c1.id + 14 = c15.id LEFT JOIN
     orderedflatcorpus c16
     ON c1.id + 15 = c16.id LEFT JOIN
     orderedflatcorpus c17
     ON c1.id + 16 = c17.id LEFT JOIN
     orderedflatcorpus c18
     ON c1.id + 17 = c18.id LEFT JOIN
     orderedflatcorpus c19
     ON c1.id + 18 = c19.id LEFT JOIN
     orderedflatcorpus c20
     ON c1.id + 19 = c20.id LEFT JOIN
     orderedflatcorpus c21
     ON c1.id + 20 = c21.id 
WHERE c8.pos LIKE 'vv%' AND
      c9.pos = 'appge' AND
      c10.word = 'way' AND
      c11.pos LIKE 'i%' AND
      c12.word = 'the' AND
      c13.pos LIKE 'n%' ;

Next: I suspect that country and/or textid should be part of the join conditions.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks! Unfortunately, this does not solve the problem. The number of lines I get still depends on the number of columns I extract (the numbers differ by 57), even though the WHERE conditions I use are exactly the same. I just don't understand why more columns should cause the algorithm to omit rows which are relevant (despite the LEFT JOINS). Any help would be greatly appreciated. Thanks! – Znusgy Dec 21 '17 at 10:47