I have data of the following form:
#@ <abc>
<http://stackoverflow.com/questions/ask> <question> _:question1 .
#@ <def>
<The> <second> <http://line> .
#@ <ghi>
_:question1 <http#responseCode> "200"^^<http://integer> .
#@ <klm>
<The> <second> <http://line1.xml> .
#@ <nop>
_:question1 <date> "Mon, 23 Apr 2012 13:49:27 GMT" .
#@ <jkn>
<G> <http#fifth> "200"^^<http://integer> .
#@ <k93>
_:question1 <http#responseCode> "200"^^<http://integer> .
#@ <k22>
<This> <third> <http://line2.xml> .
#@ <k73>
<http://site1> <hasAddress> <http://addr1> .
#@ <i27>
<kd8> <fourth> <http://addr2.xml> .
Now whenever two lines are equal, like: _:question1 <http#responseCode> "200"^^<http://integer> .
, then I want to delete the equal lines (lines which match with each other character by character are equal lines) along with (i). the subsequent line (which ends with a fullstop) (ii). line previous to the equal line (which begins with #@).
#@ <abc>
<http://stackoverflow.com/questions/ask> <question> _:question1 .
#@ <def>
<The> <second> <http://line> .
#@ <nop>
_:question1 <date> "Mon, 23 Apr 2012 13:49:27 GMT" .
#@ <jkn>
<G> <http#fifth> "200"^^<http://integer> .
#@ <k73>
<http://site1> <hasAddress> <http://addr1> .
#@ <i27>
<kd8> <fourth> <http://addr2.xml> .
Now one way to do this is to store all these lines in a set in python and whenever two lines are equal (i.e. they match character by character) the previous and subsequent two lines are deleted. However, the size of my dataset is 100GB (and I have RAM of size 64GB), therefore I can not keep this information in set form in main-memory. Is there some way by which I can delete the duplicate lines along with their previous and subsequent two lines in python with limited main-memory space (RAM size 64 GB)