1

I have a very large CSV file that I want to import straight into Postgresql with COPY. For that, the CSV column headers need to match DB column names. So I need to do a simple string replace on the first line of the very large file.

There are many answers on how to do that like:

All the answers imply creating a copy of the large file or using file-system level solutions that access the entire file, although only the first line is relevant. That makes all solutions slow and seemingly overkill.

What is the underlying cause that makes this simple job so hard? Is it file-system related?

Hauke
  • 2,554
  • 4
  • 26
  • 29
  • 5
    It is related to the operating system. It has nothing to do with Python specifically. – Reut Sharabani Apr 10 '19 at 12:14
  • 2
    Why use PostGres to solve the issue? Import the file as-is, Into a temporary table. If you have a pre-existing table, you can then merge the data from the temporary table into the main one (with proper column name mapping) – rdas Apr 10 '19 at 12:15
  • 1
    No usual file system (there may be unusual ones) allows to insert or delete bytes in the middle of a file. – Michael Butscher Apr 10 '19 at 12:16
  • I removed the Python references. – Hauke Apr 14 '19 at 19:16

1 Answers1

4

The underlying cause is that a .csv file is a textfile, and making changes to the first line of the file implies random access to the first "record" of the file. But textfiles don't really have "records", they have lines, of unequal length. So changing the first line implies reading the file up to the first carriage return, putting something in its place, and then moving all of the rest of the data in the file to the left, if the replacement is shorter, or to the right if it is longer. And to do that you have two choices. (1) Read the entire file into memory so you can do the left or right shift. (2) Read the file line by line and write out a new one.

It is easy to add stuff at the end because that doesn't involve displacing what is there already.

BoarGules
  • 16,440
  • 2
  • 27
  • 44