0

I am new to python and using python 2.7. I am writing a program to parse raw re files . I have written a function which calls a file and puts every 4 line in a list . My file is big say 4 GB of raw dna data.

def filerd(f):
           identifier = []
           with open(f,'r') as inputfile:
            count = 1
            for line in inputfile:
              if count%4 == 1:
                identifier.append(line)
                count = count + 1
              else:
                count = count + 1
              return identifier

Now how can i parallelize this function so that i can get speedup. Is there any way when i can run this function on 5 cores of my server?

jordanm
  • 33,009
  • 7
  • 61
  • 76
  • What would be your idea of what parallelization might look like? – IonicSolutions May 07 '18 at 13:11
  • I am specifically looking for "for loop paralleism" :Comparing with C , the for loop can be parallelized openmp for loop directive . – Amjad Syed May 07 '18 at 13:23
  • Well, you will need to split your data into several chunks. Then you can use `multiprocessing` to process each of these chunks in a parallel process. – IonicSolutions May 07 '18 at 13:27
  • However, I believe that your code can be made much more efficient and faster, e.g. using [`itertools.islice`](https://docs.python.org/3/library/itertools.html#itertools.islice). – IonicSolutions May 07 '18 at 13:34
  • So that means the process of splitting data into chunks can not be parallelized in python ? i have large file just need to iterate through it and create a list based on certain conditions? This iteration can not be parallelized? – Amjad Syed May 07 '18 at 13:36
  • The problem you are likely to run into is that you will need to pass the line data across to the child processes, which could be inefficient. It *might* be more efficient to have all subprocesses read the entire file, but only process every 4th line. E.g. subprocess 1 reads all lines but only processes line 1, 5, 9, etc. subprocess 2 does 2, 6, 10 etc – Tom Dalton May 07 '18 at 13:43
  • See also this answer https://stackoverflow.com/questions/14124588/shared-memory-in-multiprocessing about multiprocessing.Array which is a way of sharing memory between processes – Tom Dalton May 07 '18 at 13:44
  • Possible duplicate of [How do I parallelize a simple Python loop?](https://stackoverflow.com/questions/9786102/how-do-i-parallelize-a-simple-python-loop) – Mohammad Athar May 07 '18 at 15:17

1 Answers1

0

As I mentioned in my comment above, you can likely gain a lot of speed just from optimizing your function. I suggest to try the following:

import itertools

def filerd(f):
    with open(f, "r") as inputfile:
        return list(itertools.islice(inputfile, None, None, 4))

If you do not need the return value to be a list, but are fine with an iterator, you can remove the list(). Then, quite likely the slowest part will be loading the data from disk, which you need to do anyway.

IonicSolutions
  • 2,559
  • 1
  • 18
  • 31