0

I have a text file (1 Billion lines) of 60GB size. I have to extract data corresponds to specified line numbers which can be read from another text file (eg:1, 4, 70, 100...etc). Due to the size I can't load data to memory and then extract lines. Also, line by line matching and extraction would take many days of time. Is there any solution exist for this problem?

2 methods which I tried:

1. first method

f = open('line_numbers.txt')
lines = f.readlines()
numbers =[int(e.strip()) for e in lines]
r = max(numbers)
file = open('OUTPUT_RESULT.txt','w') 
with open('Large_File.txt') as infile:
        for num, line in enumerate(infile,1):
                if (num<= r):
                        if (num in numbers):
                                file.write(line)
                        else:
                                pass
                        print(num)

It will take many days to get the result

2. second method

import pandas as pd
data = pd.read_csv('Large_File.txt', header=None)
file = open('OUTPUT_RESULT.txt','w') 

f = open('line_numbers.txt')
lines = f.readlines()
numbers =[int(e.strip()) for e in lines]

x = data.loc[numbers,:]
file.write(x)

It does not load file to memory

Is there any solution available to resolve this?

Sara S
  • 153
  • 5
  • you could split the file in to chunks and then find the row you need would simply be finding the file name chunk which is closest to that line. – gold_cy Apr 02 '19 at 03:37
  • There are solutions to this, like databases. – Klaus D. Apr 02 '19 at 03:41
  • The line number file contains the line number (of data to be extracted) corresponds to large text file. Splitting to chunks will alter the line numbers right? – Sara S Apr 02 '19 at 03:46
  • If "it will take many days" to process a mere 60Gb text file with this method (#1), the problem is probably not your Python code, but the machine you're running on, or the connection to the storage the file is on. Using the exact same method, processing a 2Gb file only takes about 12 seconds on my simple laptop. There may be some tiny amount of overhead with a file 30x that size, but I'd expect that code to complete within about 10 minutes. – Grismar Apr 02 '19 at 03:57
  • The machine has 32GB RAM and latest configurations. The problem is with #1 is that when I want to extract data from line no:1,800,000,000, it has to match file from 1 to 1,800,000,000. That will take many days of time. @Grismar – Sara S Apr 02 '19 at 04:02
  • The file I just tested your code on has 50,000,000 lines and the search for 1,000 (randomly picked) lines completed within 12 seconds. I don't think 36x the amount of data should take it from 12 seconds to multiple days. Have you tried putting the data on a drive local to the script? (or the script on the machine that has the data?) – Grismar Apr 02 '19 at 04:25
  • Does this data extraction happen often? If so, I would recommend looking into [indexing your file](http://code.activestate.com/recipes/578828-indexing-text-files-with-python/) which will immensely help you. If not, the answer to [this related question](https://stackoverflow.com/questions/16669428/process-very-large-20gb-text-file-line-by-line) could help. – Olivier Samson Apr 02 '19 at 03:40
  • Did you use the exact same code or you did any modification to it? Can you share the code you tested? @Grismar – Sara S Apr 02 '19 at 04:37
  • @SaraS, I initially used the same code, but the one optimisation may be all you're looking for - I don't do `num in numbers`, since that's costly, but I just look at the first of the sorted numbers and remove it as soon as it has passed. – Grismar Apr 02 '19 at 06:53
  • roger that : ) @Grismar – Sara S Apr 02 '19 at 11:56

1 Answers1

0

Your issue is probably with the if (num in numbers) line. Not only does it not need the parentheses, but it also checks this for every iteration, even though your code goes through the file in order (first line 1, then line 2, etc.).

That can be easily optimised and doing so, the code below ran in only 12 seconds on a test file of about 50 million lines. It should process your file in a few minutes.

import random

numbers = sorted([random.randint(1, 50000000) for _ in range(1000)])
outfile = open('specific_lines.txt', 'w')
with open('archive_list.txt', 'r', encoding='cp437') as infile:
    for num, line in enumerate(infile, 1):
        if numbers:
            if num == numbers[0]:
                outfile.write(line)
                print(num)
                del numbers[0]
            else:
                pass

Note: this generates a 1,000 random line numbers, replace with your loaded numbers like in your example. If your list of number is far greater, the write time for the output file will increase execution time somewhat.

Your code would be like:

with open('line_numbers.txt') as f:
    lines = f.readlines()
numbers = sorted([int(e.strip()) for e in lines])
outfile = open('specific_lines.txt', 'w')
with open('archive_list.txt', 'r', encoding='cp437') as infile:
    for num, line in enumerate(infile, 1):
        if numbers:
            if num == numbers[0]:
                outfile.write(line)
                print(num)
                del numbers[0]
            else:
                pass
Grismar
  • 27,561
  • 4
  • 31
  • 54