I have the following psql table. It has roughly 2 billion rows in total.
id word lemma pos textid source
1 Stuffing stuff vvg 190568 AN
2 her her appge 190568 AN
3 key key nn1 190568 AN
4 into into ii 190568 AN
5 the the at 190568 AN
6 lock lock nn1 190568 AN
7 she she appge 190568 AN
8 pushed push vvd 190568 AN
9 her her appge 190568 AN
10 way way nn1 190568 AN
11 into into ii 190568 AN
12 the the appge 190568 AN
13 house house nn1 190568 AN
14 . . 190568 AN
15 She she appge 190568 AN
16 had have vhd 190568 AN
17 also also rr 190568 AN
18 cajoled cajole vvd 190568 AN
19 her her appge 190568 AN
20 way way nn1 190568 AN
21 into into ii 190568 AN
22 the the at 190568 AN
23 home home nn1 190568 AN
24 . . 190568 AN
.. ... ... .. ... ..
I would like to create the following table, which shows all "way"-constructions with the words side-by-side and some data from the columns "source", "lemma" and "pos".
source word word word lemma pos word word word word word lemma pos word word
AN lock she pushed push vvd her way into the house house nn1 . she
AN had also cajoled cajole vvd her way into the home home nn1 . A
AN tried to force force vvi her way into the palace palace nn1 , officials
Here you can see the code I use:
copy(
SELECT c1.source, c1.word, c2.word, c3.word, c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM
orderedflatcorpus AS c1, orderedflatcorpus AS c2, orderedflatcorpus AS c3, orderedflatcorpus AS c4, orderedflatcorpus AS c5, orderedflatcorpus AS c6, orderedflatcorpus AS c7, orderedflatcorpus AS c8, orderedflatcorpus AS c9, orderedflatcorpus AS c10, orderedflatcorpus AS c11
WHERE
c1.word LIKE '%' AND
c2.word LIKE '%' AND
c3.word LIKE '%' AND
c4.pos LIKE 'v%' AND
c5.pos = 'appge' AND
c6.lemma = 'way' AND
c7.pos LIKE 'i%' AND
c8.word = 'the' AND
c9.pos LIKE 'n%' AND
c10.word LIKE '%' AND
c11.word LIKE '%'
AND
c1.id + 1 = c2.id AND c1.id + 2 = c3.id AND c1.id + 3 = c4.id AND c1.id + 4 = c5.id AND c1.id + 5 = c6.id AND c1.id + 6 = c7.id AND c1.id + 7 = c8.id AND c1.id + 8 = c9.id AND c1.id + 9 = c10.id AND c1.id + 10 = c11.id
ORDER BY c1.id
)
TO
'/home/postgres/Results/OUTPUT.csv'
DELIMITER E'\t'
csv header;
The query takes almost 9 hours to execute for the two billion rows (the result has about 19,000 rows).
What could I do to improve performance?
The word, pos and lemma columns already have btree indices.
Should I stick to my code and simply use a more powerful server with more cores/a faster CPU and more RAM (mine has only 8 GBs of RAM, a mere 2 cores and 2.8 GHz) ? Or would you recommend a different, more efficient SQL query?
Thanks!