3

I have a large file that is named e.g. XXX_USR.txt. I iterate through the folder and some of the txt files are over 500 MB large. In order to avoid MEMORY ERROR, I need to append the files line-by-line. However, my current method is way too slow. The first line is appended by |SYS and all of the other lines are appended by '| ' + amendtext . amendtext is a variable that takes the first string from the name of the .txt file before the first underscore e.g. "XXX".

File: XXX_USR.txt

INPUT: 

| name | car |
--------------
| Paul |Buick|
|Ringo |WV   |
|George|MG   |
| John |BMW  |

DESIRED OUTPUT:

|SYS  | name | car |
--------------------
| XXX | Paul |Buick|
| XXX |Ringo |WV   |
| XXX |George|MG   |
| XXX | John |BMW  |

My code that is way too slow, but beats the memory error.

import os
import glob
from pathlib import Path

cwd = 'C:\\Users\\EricClapton\\'

directory = cwd

txt_files = os.path.join(directory, '*.txt')

for txt_file in glob.glob(txt_files):
    cpath =(Path(txt_file).resolve().stem)

    nametxt = "-".join(cpath.split('_')[0:1])
    amendtext = "|  " + nametxt
    systext = "|   SYS"

    with open(txt_file,'r', errors='ignore') as f:
        get_all=f.readlines()

    with open(txt_file,'w') as f:
        for i,line in enumerate(get_all,1):        
            if i == 1:                              
                f.writelines(systext + line)
            else:
                f.writelines(amendtext + line)
Kokokoko
  • 452
  • 1
  • 8
  • 19
  • 1
    Possible duplicate of [Process very large (>20GB) text file line by line](https://stackoverflow.com/questions/16669428/process-very-large-20gb-text-file-line-by-line) – sahasrara62 Oct 09 '19 at 13:58
  • This is a nice recommendation, but it does not directly solve my issue with writing two different string into the selected lines. – Kokokoko Oct 09 '19 at 14:24
  • this can be helpful [solution](https://stackoverflow.com/questions/6475328/how-can-i-read-large-text-files-in-python-line-by-line-without-loading-it-into) you need to modify this code accordingly, cannot always expect to have a tailored maid solution – sahasrara62 Oct 09 '19 at 14:36

2 Answers2

2

What exactly do you mean by too slow? Does it run in seconds or minutes? I can say I ran a similar situation on my laptop and for a file of over 1G and 35946689 lines and it took about 29s.

I used the in-place module to open the file in an edit-type mode, instead of read and/or write. This eliminates the need of double storing the data while working with it.

with in_place.InPlace(txt_file) as f:
    for line in f:
        f.write(amendtext + line)

Also, do not run it from an IDE. I can slow down the process as well as have limitations on what you can do.

Update:

I think I understand what's causing the delay in your execution time. In your original code you were executing the conditional checks on every iteration when looping through the file content.
In your updated code, you are now opening the file for read and write four times and storing all its content. Here's the updated code that will handle your need to modify the first line without conditional checks.

with in_place.InPlace(txt_file) as f:
    f.write(systext + f.readline())
    for line in f:
        f.write(amendtext + line)

The first line inside with will read the first line from your text file, modify it and then write it.
At this point, the iterator will move onto the next line from whereon you can process the data as you wish.

slybloty
  • 6,346
  • 6
  • 49
  • 70
  • I have tested both `with open` method and `in_place` an they are both blazing fast. the issue is apparently in the line `for i,line in enumerate(get_all,1):` and the enumeration of the lines. – Kokokoko Oct 10 '19 at 09:07
  • 1
    @Kokokoko please see my updated answer based on your comment. – slybloty Oct 10 '19 at 15:20
  • 1
    I did it my way, but I tried yours as well and it works nicely. Thank you for introducing me to in_place. – Kokokoko Oct 11 '19 at 08:15
1

at the end the enumerate method was not good for reading such big file line by line and enumerating the lines. I have used readlines method instead. No I am reading the file into separate chunks and then writing and appending the files with prepending string.

import os
import glob
from pathlib import Path

cwd = 'C:\\Users\\EricClapton\\'

directory = cwd

txt_files = os.path.join(directory, '*.txt')

for txt_file in glob.glob(txt_files):
    cpath =(Path(txt_file).resolve().stem)

    nametxt = "-".join(cpath.split('_')[0:1])
    amendtext = "|  " + nametxt
    systext = "|   SYS"

with open(txt_file,'r', errors='ignore') as f:
    get_all=f.readlines()[:1]

with open(txt_file,'r', errors='ignore') as s:
    get_itdone=s.readlines()[1:]

with open(txt_file, 'w') as k:
    for line in get_all:
        k.write(systext + line)

with open(txt_file, 'a+') as a:
    for line in get_itdone:
        a.write(amendtext + line)
Kokokoko
  • 452
  • 1
  • 8
  • 19