How can I alter the name of a column for a large (60gig) csv file without loading the entire file in memory?

Question

I'm trying to load the Opioid data from the Washington Post, a 60 gig csv, into MySQl. the problem is the very first column header is blank (the rows in that column have numbers, just there is no header), and MySQl gives me an error in MySQL when i try to load it in with the table data import wizard.

Starting...
Prepare Import...
Prepare Import done
Import data file....

Traceback (most recent call last):
  File "/usr/share/mysql-workbench/libraries/workbench/wizard_progress_page_widget.py", line 192, in thread_work
    self.func()
  File "/usr/lib/mysql-workbench/modules/sqlide_power_import_wizard.py", line 125, in start_import
    retval = self.module.start(self.stop)
  File "/usr/lib/mysql-workbench/modules/sqlide_power_import_export_be.py", line 271, in start
    ret = self.start_import()
  File "/usr/lib/mysql-workbench/modules/sqlide_power_import_export_be.py", line 408, in start_import
    if not self.prepare_new_table():
  File "/usr/lib/mysql-workbench/modules/sqlide_power_import_export_be.py", line 237, in prepare_new_table
    self._editor.executeManagementCommand(""" CREATE TABLE %s (%s)""" % (self._table_w_prefix, ", ".join(["`%s` %s" % (col['name'], col["type"]) for col in self._mapping])), 1)
DBError: ("Incorrect column name ''", 1166)
ERROR: Import data file: ("Incorrect column name ''", 1166)
Failed

As you can see at the bottom, the error is the incorrect column name of '', which is the empty column header. but as I cannot load the entire 60gig csv in memory, I'm having trouble altering the column names.

What I want is to be able to edit that very first column to have some name like 'Index' or '#'. Anything so that MySQL will stop rejecting it.

I use python mainly so if anyone can give me a way of altering that column name in python I would be extremely appreciative. This is a problem I come back to every few weeks but so far haven't had any success with. Whatever help you can give me would be great.

I can't remember what I've tried. I've searched through here and given a few things a shot but I can't ever find a solution that exactly matches my problem.

What I want is simply to change the name of that first column so that when I go to load it into MySQl, I don't get that error.

There should be no problem, in theory, with _any_ column being blank. Is it possible that the header for the CSV file is missing a label for the emtpy first column? — Tim Biegeleisen, Aug 31 '19 at 02:15
Yes I'm sorry, i mean the header for that column is blank, the column isn't totally blank, just the header. How would I check and correct that? — jake.csc, Aug 31 '19 at 02:37

score 2 · Accepted Answer · answered Aug 31 '19 at 03:40

2

Open the file; read the first line; modify it; write it to a new file; then read the rest of the lines one at a time and write them to the new file.

with open('old.csv') as f, open('new.csv', 'w') as g:
    hdr = next(f)
    hdr = 'idx' + hdr
    g.write(hdr)
    for line in f:
        g.write(line)

answered Aug 31 '19 at 03:40

wwii

23,232
7
37
77

Well that fixed my problem, but now when I try to load it into mySQl it acts like it all works then doesn't load anything. Givges me no erros but says 0 lines loaded. But at least the column name thing if fixed. thank you. – jake.csc Aug 31 '19 at 19:52
It might be helpful if we saw the first few lines of the file. – GuyStalks Aug 31 '19 at 22:55
@GuyStalks This example actually works. Found out it wasnt loading because of some weird thing with the datetime. When I let it run datetime as an integer it suddenly worked and now its in the process of loading all my data – jake.csc Sep 01 '19 at 21:19
I'm happy for you! And I'm impressed by this code, myself. It's very concise. – GuyStalks Sep 02 '19 at 04:08

GuyStalks · Answer 2 · 2019-08-31T21:17:01.917

I’m sorry you’ve been struggling with this so long!

I agree with Tim that there is probably a setting on your importer that will fix the this, but let’s try to answer your question anyway.

First, there are loads of functions in Python that allow you to interact with a file without loading it all into memory!

For instance

open(file)

simply creates a file object that theoretically is your file but is really, say, a reference to your file.

If I were you, I’d try out the read() and write() functions.

Continued:

Let's imagine you have a file called "test.csv" that has the following data:

test1,test2
a,b

Then, if you want to change the header and copy the rest of the file without loading it all into memory, you can read a line from this file and write it into a new file.

INPUT_FILENAME = "test.csv"
OUTPUT_FILENAME = "test2.csv"


f = open(INPUT_FILENAME, "r")
firstline = f.readline()
print(firstline)
# "test1,test2\n"
altered_firstline = firstline.replace("test2","SomethingElse")
print(altered_firstline)
#"test1,SomethingElse\n"

g = open("OUTPUT_FILENAME", "w")
g.write(altered_firstline)

while True:
    nextline = f.readline()
    if nextline == '':
        break
    else:
        a = g.write(f.readline()) # "a = ..." suppresses stdout

g.close()
f.close()

g = open(OUTPUT_FILENAME, "r")
g.readline()
#'test1, SomethingElse\na,b\n'
g.close()

This took just a couple seconds on my computer with a 0.5Gb file. Hope it works for you!

Note that the code above was inspired by this answer but it differs in that this is specifically for csv files.

what I'm trying right now is to use df.to_csv to try to write a new csv. I'm using a new header that I typed up that has a name for that first column and hoping it works. can you explain how you would use read and write to solve this problem I'm having? — jake.csc, Aug 31 '19 at 02:55

How can I alter the name of a column for a large (60gig) csv file without loading the entire file in memory?

2 Answers2