1

I have been searching for a solution for this and haven't been able to find one. I have a directory of folders which contain multiple, very-large csv files. I'm looping through each csv in each folder in the directory to replace values of certain headers. I need the headers to be consistent (from file to file) in order to run a different script to process all the data properly.

I found this solution that I though would work: change first line of a file in python.

However this is not working as expected. My code:

        from_file = open(filepath)
            # for line in f:
            #     if
        data = from_file.readline()
            # print(data)
        # with open(filepath, "w") as f:
        print 'DBG: replacing in file', filepath
            # s = s.replace(search_pattern, replacement)
        for i in range(len(search_pattern)):
            data = re.sub(search_pattern[i], replacement[i], data)
            # data = re.sub(search_pattern, replacement, data)
        to_file = open(filepath, mode="w")
        to_file.write(data)
        shutil.copyfileobj(from_file, to_file)

I want to replace the header values in search_pattern with values in replacement without saving or writing to a different file - I want to modify the file. I have also tried

        shutil.copyfileobj(from_file, to_file, -1)

As I understand it that should copy the whole file rather than breaking it up in chunks, but it doesn't seem to have an effect on my output. Is it possible that the csv is just too big?

I haven't been able to determine a different way to do this or make this way work. Any help would be greatly appreciated!

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • 1
    don't open files for reading AND writing at the same time! write in another file, then once copied, delete the original file and rename the new one to the original name. – Jean-François Fabre Jun 01 '18 at 19:39
  • 1
    and if you don't want to do that, well, you can, using read/write mode, but the replacement pattern must have the same size as the search pattern. read/write mode doesn't work well with text files like CSV. If both patterns have the same size, that would be _very_ fast because the rest of the file isn't even copied. – Jean-François Fabre Jun 01 '18 at 19:40
  • what OS are you on? cos this solution doesn't work in Windows. – Jean-François Fabre Jun 01 '18 at 19:46
  • the replacement and the search pattern are the same size. I'll look into writing in another file, deleting the original, then renaming. Thanks for the suggestion. Well that's probably why then... I'm in Windows. – Chris Webber Jun 01 '18 at 19:47
  • Are you sure this isn't an XY problem? If you continually need to do some preprocessing/transformation with the header row, then why not store it in a separate file, read it first, do that preprocessing, then read the csv with your computed header. – smci Jun 01 '18 at 20:26

1 Answers1

1

this answer from change first line of a file in python you copied from doesn't work in windows

On Linux, you can open a file for reading & writing at the same time. The system ensures that there's no conflict, but behind the scenes, 2 different file objects are being handled. And this method is very unsafe: if the program crashes while reading/writing (power off, disk full)... the file has a great chance to be truncated/corrupt.

Anyway, in Windows, you cannot open a file for reading and writing at the same time using 2 handles. It just destroys the contents of the file.

So there are 2 options, which are portable and safe:

  1. create a file in the same directory, once copied, delete first file, and rename the new one

Like this:

import os
import shutil

filepath = "test.txt"

with open(filepath) as from_file, open(filepath+".new","w") as to_file:
    data = from_file.readline()
    to_file.write("something else\n")
    shutil.copyfileobj(from_file, to_file)
os.remove(filepath)
os.rename(filepath+".new",filepath)

This doesn't take much longer, because the rename operation is instantaneous. Besides, if the program/computer crashes at any point, one of the files (old or new) is valid, so it's safe.

  1. if patterns have the same length, use read/write mode

like this:

filepath = "test.txt"

with open(filepath,"r+") as rw_file:
    data = rw_file.readline()
    data = "h"*(len(data)-1) + "\n"
    rw_file.seek(0)
    rw_file.write(data)

Here we, read the line, replace the first line by the same amount of h characters, rewind the file and write the first line back, overwriting previous contents, keeping the rest of the lines. This is also safe, and even if the file is huge, it's very fast. The only constraint is that the pattern must be of the exact same size (else you would have remainders of the previous data, or you would overwrite the next line(s) since no data is shifted)

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • Jean-Francois, that is a fantastic answer and incredibly helpful. Thank you! I may end up trying both to see which one is actually faster. – Chris Webber Jun 01 '18 at 21:16
  • my pleasure was writing it :) no kidding. For a fixed pattern, of course, the second version will be blazingly fast since it doesn't even _touch_ the rest of the data. – Jean-François Fabre Jun 01 '18 at 21:21