0

I am trying to get the even lines from my very big file(~300GB),and I am able to do it for a file with almost the same size that I am getting the error is. The code is :

import itertools
import sys, os

with open('FILE.fasta') as f:
    fd = open("FILE.txt","w")
    fd.writelines(set(itertools.islice(f, 0, None, 2)))
    fd.close()

And the error is :

   Traceback (most recent call last):
   File "new3.py", line 7, in <module>
   fd.writelines(set(itertools.islice(f, 0, None, 2)))
   SystemError: Negative size passed to PyString_FromStringAndSize

Do you indeed think it is because the file is way too big? I have checked the memory usage while the code was working, and it was never more than 50%..

I would appreciate any help!

bapors
  • 887
  • 9
  • 26
  • 2
    Sounds more like a overflow in PyString_FromStringAndSize. Can you move the itertools call into a temp variagble? Then we have a more useful stacktrace – Christian Sauer Sep 22 '17 at 08:18
  • Given your question, iterate of the input file handle, with `enumerate()` to get the line numbers and write the even lines – Chris_Rands Sep 22 '17 at 08:18
  • Also, if you only need every other line is the `set` necessary? – Cristian Lupascu Sep 22 '17 at 08:21
  • The source for the function is here: https://svn.python.org/projects/python/trunk/Objects/stringobject.c It is not itertools fault but Oytrhon itself. – Christian Sauer Sep 22 '17 at 08:22
  • @ChristianSauer Thank you for your reply. Is it possible that you can explain how this file would help me? Sorry that I got a bit confused – bapors Sep 22 '17 at 09:19
  • @GolfWolf because some lines are duplicates or triplicates, and I only want one of them – bapors Sep 22 '17 at 09:19
  • @Chris_Rands Wouldnt it also consume so much memory? Because counting lines itself is very memory consuming – bapors Sep 22 '17 at 09:21
  • 1
    @bapors No `enumerate()` returns an iterator and you only hold one line in memory at a time; also it's now clear your question has a 2nd part (remove duplicate lines) and `set`s are not ordered! you want something like this https://stackoverflow.com/questions/1215208/how-might-i-remove-duplicate-lines-from-a-file – Chris_Rands Sep 22 '17 at 09:41
  • @Chris_Rands but then I should keep the even lines so that would be again very big in size? Can you show an example? – bapors Sep 22 '17 at 09:42

1 Answers1

0

Don't make set from the underlying iterator - it's extremely expensive procedure. You should be able to give this iterator to writelines directly:

fd.writelines(itertools.islice(f, 0, None, 2))

Other small nit:

You don't need to write

import sys, os

because you have already imported sys on the line above. Either remove the line above or write import os.

Ashalynd
  • 12,363
  • 2
  • 34
  • 37
  • Thank you for your reply, however, it does not.. I still see duplicates if I do not specify it as `set` – bapors Sep 22 '17 at 09:16