Copying a huge file using Python scripts

Question

I am using below python function to make a copy of file, which I am processing as part of data ingestion in Azure datafactory pipelines . This works for for small files, but fails to process huge files without returning any errors .On calling this function for 2.2 GB file , it stops the execution after writing 107 KB of data without throwing any exceptions .Can anyone point out what could be the issue here

 with open(Temp_File_Name,encoding='ISO-8859-1') as a, open(Load_File_Name, 'w') as b:
  for line in a:    
    if ' blah blah ' in line:
      var=line[34:42]
    data = line[0:120] + var + '\n'
    b.write(data)

The input and output location , I have used here are files in Azure Blob storage .I am following this approach , as I need to read each line and perform some operation after reading it

Not 100% sure what you're doing, but you may have issues with `var` retaining its value between lines, and also with `IndexError`s if your slicing goes out of bounds within some line. Suggest some `try`/`except`s around the slicing and an `else` clause to the `if` which resets `var = ""`. — Cai, Nov 02 '22 at 10:24

score 0 · Answer 1 · answered Mar 17 '21 at 14:48

0

Your snippet wastes compute/time loading the file into the ram and writing it back to disk. Use the underlying OS to copy the file using pythons pathlib and shutil.

Have a look at this stackoverflow post. Here's the top answer:

import pathlib
import shutil

my_file = pathlib.Path('/etc/hosts')
to_file = pathlib.Path('/tmp/foo')

shutil.copy(str(my_file), str(to_file))  # For older Python.
shutil.copy(my_file, to_file)  # For newer Python.

answered Mar 17 '21 at 14:48

DannyDannyDanny

838
9
26

1

But what should be the best approach if I need to do a line by line copying, this is because I have to perform some operation after reading each line – dileepVikram Mar 17 '21 at 17:24
Can you supply some more information? what operation you are performing on each line? what file format does the infile have? – DannyDannyDanny Mar 18 '21 at 09:28
I am checking each line to see if it has a particular string in it . If yes I will do a substring operation and append that substring in each line before writing to new file . – dileepVikram Mar 18 '21 at 09:57
Could you edit the question and add the definition of if `isslice`? – DannyDannyDanny Mar 18 '21 at 10:01
I see. Perhaps adding read-only mode with `'r'` might speed it up: `open(Temp_File_Name, 'r' ,encoding='ISO-8859-1')`. If you only need the lines in the source file that match the pattern, you could reduce the input file using regex. – DannyDannyDanny Mar 18 '21 at 14:37

score 0 · Answer 2 · answered Dec 02 '22 at 07:15

0

You can use os and rsync

--no-whole-file or --no-W parameters use the block-level sync instead of the file level syncing.

--progress is used for getting the logs of file transfer

You can also use file_name.log for adding logs into that file instead of on terminal and that file will be saved at the current location where you are running the program

for e.g,

os.system("rsync -r --progress --no-W --stats 'Source path' 'Destination Path' > file_name.log")

import os
os.system("rsync -r --progress --no-W --stats 'Source path' 'Destination Path'")

answered Dec 02 '22 at 07:15

Mrunal Shah

21
7

is there any size limit for copy with this ```rsync``` command ? – Umang Bhadja Dec 02 '22 at 07:19
Not sure about the size limit but I have tried around 52 GB file transfer and it worked properly – Mrunal Shah Dec 02 '22 at 07:22
great! I'll try and let you know if there is any problem. – Umang Bhadja Dec 02 '22 at 07:26
Yeah sure!! Update me after trying – Mrunal Shah Dec 05 '22 at 05:15

Copying a huge file using Python scripts

2 Answers2