0

I am using below python function to make a copy of file, which I am processing as part of data ingestion in Azure datafactory pipelines . This works for for small files, but fails to process huge files without returning any errors .On calling this function for 2.2 GB file , it stops the execution after writing 107 KB of data without throwing any exceptions .Can anyone point out what could be the issue here

 with open(Temp_File_Name,encoding='ISO-8859-1') as a, open(Load_File_Name, 'w') as b:
  for line in a:    
    if ' blah blah ' in line:
      var=line[34:42]
    data = line[0:120] + var + '\n'
    b.write(data)

The input and output location , I have used here are files in Azure Blob storage .I am following this approach , as I need to read each line and perform some operation after reading it

dileepVikram
  • 890
  • 4
  • 14
  • 30
  • Not 100% sure what you're doing, but you may have issues with `var` retaining its value between lines, and also with `IndexError`s if your slicing goes out of bounds within some line. Suggest some `try`/`except`s around the slicing and an `else` clause to the `if` which resets `var = ""`. – Cai Nov 02 '22 at 10:24

2 Answers2

0

Your snippet wastes compute/time loading the file into the ram and writing it back to disk. Use the underlying OS to copy the file using pythons pathlib and shutil.

Have a look at this stackoverflow post. Here's the top answer:

import pathlib
import shutil

my_file = pathlib.Path('/etc/hosts')
to_file = pathlib.Path('/tmp/foo')

shutil.copy(str(my_file), str(to_file))  # For older Python.
shutil.copy(my_file, to_file)  # For newer Python.
DannyDannyDanny
  • 838
  • 9
  • 26
  • 1
    But what should be the best approach if I need to do a line by line copying, this is because I have to perform some operation after reading each line – dileepVikram Mar 17 '21 at 17:24
  • Can you supply some more information? what operation you are performing on each line? what file format does the infile have? – DannyDannyDanny Mar 18 '21 at 09:28
  • I am checking each line to see if it has a particular string in it . If yes I will do a substring operation and append that substring in each line before writing to new file . – dileepVikram Mar 18 '21 at 09:57
  • Could you edit the question and add the definition of if `isslice`? – DannyDannyDanny Mar 18 '21 at 10:01
  • I see. Perhaps adding read-only mode with `'r'` might speed it up: `open(Temp_File_Name, 'r' ,encoding='ISO-8859-1')`. If you only need the lines in the source file that match the pattern, you could reduce the input file using regex. – DannyDannyDanny Mar 18 '21 at 14:37
0

You can use os and rsync

--no-whole-file or --no-W parameters use the block-level sync instead of the file level syncing.

--progress is used for getting the logs of file transfer

You can also use file_name.log for adding logs into that file instead of on terminal and that file will be saved at the current location where you are running the program

for e.g,

os.system("rsync -r --progress --no-W --stats 'Source path' 'Destination Path' > file_name.log")

import os
os.system("rsync -r --progress --no-W --stats 'Source path' 'Destination Path'")