6

I have a ~220 million row, 7 column csv file. I need to remove row 2636759. This file is 7.7GB, more than will fit in memory. I'm most familiar with R, but could also do this in python or bash.

I can't read or write this file in one operation. What is the best way to build this file incrementally on disk, instead of trying to do this all in memory?

I've tried to find this on SO but have only been able to find how to do this with files that are small enough to read/write in memory, or with rows that are at the beginning of the file.

hedgedandlevered
  • 2,314
  • 2
  • 25
  • 54
  • thanks, but I already tried pandas :-/ tons of errors, but heres two, for example `File "pandas/parser.pyx", line 788, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)` `File "pandas/parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas/parser.c:22649) pandas.parser.CParserError: Error tokenizing data. C error: Expected 7 fields in line 2636759, saw 8` – hedgedandlevered Apr 21 '16 at 20:20
  • sure, I could do 100,000 rows at a time even... how? – hedgedandlevered Apr 21 '16 at 20:23
  • Check this solution from another question. It modifies the file in place so should be fast. http://stackoverflow.com/a/2330081/4190526 – Xiongbing Jin Apr 21 '16 at 20:28
  • maybe `head -2636760 file.csv | tail -1` (in bash) – jaimedash Apr 21 '16 at 20:33
  • can I pipe that into a new file like `head -2636760 file.csv | tail -1 > newfile.csv`? asking first because I'm unfamiliar with bash and this is going to take forever to run – hedgedandlevered Apr 21 '16 at 20:37
  • wait! I may have misunderstood the question as 'extract line 2636759 and do something with it' not 'create a new csv with that line removed'. bash line does the first one – jaimedash Apr 21 '16 at 20:56
  • Yes, I need the first 2636758 lines, then 2636760 until the end. I can do the first one with head, but how do you start in the middle? – hedgedandlevered Apr 21 '16 at 20:57
  • ah... answering my own question, you'd just use tail – hedgedandlevered Apr 21 '16 at 21:04
  • sed is a better option for this in the shell -- see my answer – jaimedash Apr 21 '16 at 21:09
  • Have you tried `awk 'NR!=2636759' currentFile > newFile`? – Mark Setchell Apr 21 '16 at 21:58

3 Answers3

8

A python solution:

import os
with open('tmp.csv','w') as tmp:

    with open('file.csv','r') as infile:
        for linenumber, line in enumerate(infile):
            if linenumber != 10234:
                tmp.write(line)

# copy back to original file. You can skip this if you don't
# mind (or prefer) having both files lying around           
with open('tmp.csv','r') as tmp:
    with open('file.csv','w') as out:
        for line in tmp:
            out.write(line)

os.remove('tmp.csv') # remove the temporary file

This duplicates the data, which may not be optimal if disk space is an issue. An in place write will be more complicated without loading the whole file into RAM first


The key is that python naturally supports handling files as iterables. This means it can be lazily evaluated, and you will never need to hold the entire thing in memory at one time


I like this solution, if your primary concern isn't raw speed, because you can replace the line linenumber != VALUE with any conditional test, for example, filtering out lines that include a particular date

test = lambda line : 'NOVEMBER' in line
with open('tmp.csv','w') as tmp:
    ...
    if test(line):
    ...

In-place read-writes and memory mapped file objects (which may be considerably faster) are going to require considerably more book keeping

Community
  • 1
  • 1
en_Knight
  • 5,301
  • 2
  • 26
  • 46
2

use sed '2636759d' file.csv > fixedfile.csv

As a test for a 40,001 line 1.3G csv, removing line 40,000 this way takes 0m35.710s. The guts of the python solution from @en_Knight (just stripping the line and writing to a temp file) is ~ 2 seconds faster for this same file.

edit OK sed (or some implementations) may not work (based on feedback from questioner)

You could, in plain bash, to remove row n from a file of N rows, file.csv, you can do head -[n-1] file.csv > file_fixed.csv and tail -[N-n] file.csv >> file_fixed.csv (in both of these the expression in brackets is replaced by a plain number).

To do this, though you need to know N. The python solution is better...

jaimedash
  • 2,683
  • 17
  • 30
  • 1
    `sed: couldn't write 35 items to stdout: No space left on device`. This is on a cloud drive with tons of storage so it's doing something in memory, it appears – hedgedandlevered Apr 21 '16 at 21:14
  • @hedgedandlevered hmm. actually are you sure it's not a storage issue? I just re-ran my above tests and sed does seem to use streaming -- memory usage never got above 600kb! lots of disk I/O though – jaimedash Apr 21 '16 at 23:05
1

You can also do this in R by reading one line at a time with readLines and writing each line except one with writeLines

For a solution similar to the one in python:

con <- file('test.csv', 'r')
out_con <- file('tmp.csv', 'w')
bad_line <- 2636759 
ctr <- 1
while (length(line <- readLines(con, n=1, warn=FALSE)) > 0){
  if (ctr != bad_line)
    writeLines(line, out_con)
  ctr <- ctr + 1
}

but note this is MUCH slower than python (takes ~3m40s on the 40,000 line file where sed took 30s). Possibly increasing the chunk size n would speed things up, but sed or python seem much easier.

jaimedash
  • 2,683
  • 17
  • 30