2

I have a huge csv file which is over 250GB in size. I'd like to replace the characters " and ( with nothing. I feel like it should be pretty straight forward, but the file size makes sure that no editor opens the file.

I could also do it using python, for which the following code could be used:

with open(file) as src:
    lines = src.read()
print(lines.replace(old_string, new_string))

But this code requires the file to be in memory.

One option is to create another file, by writing lines replacing the unwanted characters. But that would mean having 2 files on disk with almost the same size. Unfortunately I don't have that much disk space on the server.

So is there a way to overwrite the lines and replace characters without creating a new file?

Some example csv lines are:

abc,"('91730', 'd9973')",1
def,"('91210', 'd9943')",1
ghi,"('91670', 'd9293')",1
Patthebug
  • 4,647
  • 11
  • 50
  • 91
  • You might want to see this answer, it discussed ways work with big files without loading all of the content into memory. http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python – postoronnim Mar 14 '17 at 16:18
  • Can you give some example csv lines? It is not clear why you need to make these changes. – Martin Evans Mar 14 '17 at 17:16
  • @MartinEvans: Question edited. – Patthebug Mar 14 '17 at 17:23
  • It might easier to just read the file as is and process it as four columns. – Martin Evans Mar 14 '17 at 17:33
  • What can your file contain? Only alpha, digits, commas, single quotes and the unwanted parens and double quote, only plain ASCII, 8 bytes characters, UTF8 characters? If you are sure that it only contains plain ASCII, you can do it in place, even if I would use C for that kind of processing. – Serge Ballesta Mar 14 '17 at 17:47

4 Answers4

0

You can iterate over file's lines like this:

with open(file, 'rt') as src:
    for line in src:
        print(line.replace('"', '').replace('(', ''))

But I would use a csvreader from CSV module.

leovp
  • 4,528
  • 1
  • 20
  • 24
  • 2
    I copied an action from your question, assuming you want to print it and redirect somewhere. You can as easily print to an opened file object. – leovp Mar 14 '17 at 16:20
  • Sorry that was my bad. Sure I can create another file and write to it. But that would mean having 2 files of the same size of the disk. Unfortunately I don't have that much disk space on the server. Is there a way replace the existing lines in the file? – Patthebug Mar 14 '17 at 17:24
  • No, it's pretty much impossible. You can only replace bytes in a file if you are not changing its size. E.g. replace byte 'DD' with 'FF'. But if you delete a character, all the other characters have to be moved. – leovp Mar 14 '17 at 17:39
0

As a compromise to creating a second file, you could just replace all the problematic characters with spaces. That way the file will stay the same size and not need rewriting. Python's translate() function is fast for doing this:

import string

table = string.maketrans('(")', '   ')
block_size = 10000000
start_pos = 0        

with open('input.csv', 'r+b') as f_input:
    while True:
        f_input.seek(start_pos)
        block = f_input.read(block_size)

        if len(block):
            f_input.seek(start_pos)
            f_input.write(block.translate(table))
        else:
            break

        start_pos += block_size

This would give you an output file looking like:

abc,  '91730', 'd9973'  ,1
def,  '91210', 'd9943'  ,1
ghi,  '91670', 'd9293'  ,1

I would though recommend you just process the file "as is" if possible:

import csv

with open('input.csv', 'rb') as f_input:
    for row in csv.reader(f_input):
        data = re.match(r"\('(.*?)', '(.*?)'", row[1]).groups()
        row[1] = data[0]
        row.insert(1, data[1])
        print row

For your data this would display:

['abc', 'd9973', '91730', '1']
['def', 'd9943', '91210', '1']
['ghi', 'd9293', '91670', '1']
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • Nice answer and code. Feel free to add a second example with removing the chars as I explained in my answer (then I'll remove mine). My first thought was to replace with spaces too, but some csv parsers / programs can have problems with spaces between the separators and the values. – Danny_ds Mar 14 '17 at 18:58
  • `skipinitialspace=True` can be used with Python's csv library. – Martin Evans Mar 14 '17 at 19:01
  • It is probably fine, but I would not rely too much on `seek` for a file with a 2G+ size in Python 2.7 32 bits. The doc says that it should `[s]et the file’s current position, like stdio‘s fseek()`. And on a 32bit system, `fseek` offset is limited to 32 bits. I could not find any reference for Python3 saying that it support large files, but the explicit reference to `fseek` has been removed from the doc. – Serge Ballesta Mar 15 '17 at 07:17
0

If your only option is to edit the file in place, you can do the following:

  • open the file in binary mode
  • read a block of data in a buffer (say 4096 bytes, which is the page size)
  • remove the characters from that buffer, or write that buffer byte by byte to a second buffer, skipping the unwanted characters.
  • then write that second buffer to the same open file after repositioning the file pointer to the right position (using seek()). (and of course, only the new size, not the full 4096 bytes)
  • keep repeating until the end of the file, and then shrink the file (set new file size) to the size of the newly written data.

So you’ll have to keep track of 2 file positions: the current read_buffer position, and the current write_buffer position in the file, and each time you read or write, reposition the file pointer.

This will also work reading and writing a byte at the time, but I don’t know how (good) Python is buffering the data, so it could be slower.

An alternative to the buffers is to use memory mapping.

I would provide some sample code, but I don’t have Python (and I don’t know Python so well).

But make sure you do some smaller tests first, because you won’t have a copy of the original file left in case of problems.

For an example of reading binary files see this question.

Community
  • 1
  • 1
Danny_ds
  • 11,201
  • 1
  • 24
  • 46
0

Unless you use a 64 bits version of Python, I would not rely on seek being able to position a pointer behind 2 or 4 Gb. I'm pretty sure that it cannot work on Python 2 32 bits because the standard library doc says (emphasize mine):

file.seek(offset[, whence]): Set the file’s current position, like stdio's fseek().

And on a 32 bit system, fseek only takes a 32 bits argument... Anyway, fseek is probably safe in Python 3, because integers are long integers, and the reference to stdio's fseek has been removed from the documentation - but I strongly advice you to twice control it...

So I would try to open the file twice, once in "rb" mode to have a read pointer on it, and once in "r+b" mode to have a write pointer on it. Here again it may on not work depending on the OS, but many allow a single process to get multiple file descriptors on same file. The code will not be not much different from @MartinEvans's answer, for Python2:

table = string.maketrans('(")', '   ')
block_size = 10000000
start_pos = 0        
with open('input.csv', 'rb') as f_input, open('input.csv', 'r+b') as f_output:
    while True:
        block = f_input.read(block_size)
        if len(block):
            f_input.seek(start_pos)
            f_output.write(block.translate(table))
        else:
            break
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252