remove known exact row in huge csv

Question

I have a ~220 million row, 7 column csv file. I need to remove row 2636759. This file is 7.7GB, more than will fit in memory. I'm most familiar with R, but could also do this in python or bash.

I can't read or write this file in one operation. What is the best way to build this file incrementally on disk, instead of trying to do this all in memory?

I've tried to find this on SO but have only been able to find how to do this with files that are small enough to read/write in memory, or with rows that are at the beginning of the file.

thanks, but I already tried pandas :-/ tons of errors, but heres two, for example `File "pandas/parser.pyx", line 788, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)` `File "pandas/parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas/parser.c:22649) pandas.parser.CParserError: Error tokenizing data. C error: Expected 7 fields in line 2636759, saw 8` — hedgedandlevered, Apr 21 '16 at 20:20
Check this solution from another question. It modifies the file in place so should be fast. http://stackoverflow.com/a/2330081/4190526 — Xiongbing Jin, Apr 21 '16 at 20:28
can I pipe that into a new file like `head -2636760 file.csv | tail -1 > newfile.csv`? asking first because I'm unfamiliar with bash and this is going to take forever to run — hedgedandlevered, Apr 21 '16 at 20:37
wait! I may have misunderstood the question as 'extract line 2636759 and do something with it' not 'create a new csv with that line removed'. bash line does the first one — jaimedash, Apr 21 '16 at 20:56
Yes, I need the first 2636758 lines, then 2636760 until the end. I can do the first one with head, but how do you start in the middle? — hedgedandlevered, Apr 21 '16 at 20:57
sed is a better option for this in the shell -- see my answer — jaimedash, Apr 21 '16 at 21:09

score 8 · Accepted Answer · edited May 23 '17 at 12:13

8

A python solution:

import os
with open('tmp.csv','w') as tmp:

    with open('file.csv','r') as infile:
        for linenumber, line in enumerate(infile):
            if linenumber != 10234:
                tmp.write(line)

# copy back to original file. You can skip this if you don't
# mind (or prefer) having both files lying around           
with open('tmp.csv','r') as tmp:
    with open('file.csv','w') as out:
        for line in tmp:
            out.write(line)

os.remove('tmp.csv') # remove the temporary file

This duplicates the data, which may not be optimal if disk space is an issue. An in place write will be more complicated without loading the whole file into RAM first

The key is that python naturally supports handling files as iterables. This means it can be lazily evaluated, and you will never need to hold the entire thing in memory at one time

I like this solution, if your primary concern isn't raw speed, because you can replace the line linenumber != VALUE with any conditional test, for example, filtering out lines that include a particular date

test = lambda line : 'NOVEMBER' in line
with open('tmp.csv','w') as tmp:
    ...
    if test(line):
    ...

In-place read-writes and memory mapped file objects (which may be considerably faster) are going to require considerably more book keeping

edited May 23 '17 at 12:13

Community

1
1

answered Apr 21 '16 at 20:25

en_Knight

5,301
2
26
46

Hmmm... I take issue with your second `with` statement in that second section. You're essentially reading that entire file in memory, minus one line. – Zizouz212 Apr 21 '16 at 20:27
Have a +1 from me :) – Zizouz212 Apr 21 '16 at 20:29
I wonder what's the point of `tmp.csv`. You essentially copied the whole file without change in the second part of your code. – Xiongbing Jin Apr 21 '16 at 20:33
@warmoverflow yes, I copied the edited version back to the original file. You could just keep the new copy and change the name, or keep both copies. This was to satisfy the original requirement of modifying a particular file, without having to go through the (potentially) nightmarish book-keeping of modifying a file in place – en_Knight Apr 21 '16 at 20:39
this works well and is a little faster than `sed` (after removing the copy back bit) – jaimedash Apr 21 '16 at 21:11
What is with rename instead of copy back? – qvpham Apr 21 '16 at 21:18
Thanks, I'll check this out later tonight (and accept if it works) – hedgedandlevered Apr 21 '16 at 21:36
This was a while ago now, but I think I remember this as the one that works – hedgedandlevered Dec 19 '16 at 18:29

jaimedash · Answer 2 · 2016-04-21T23:03:28.497

2

use sed '2636759d' file.csv > fixedfile.csv

As a test for a 40,001 line 1.3G csv, removing line 40,000 this way takes 0m35.710s. The guts of the python solution from @en_Knight (just stripping the line and writing to a temp file) is ~ 2 seconds faster for this same file.

edit OK sed (or some implementations) may not work (based on feedback from questioner)

You could, in plain bash, to remove row n from a file of N rows, file.csv, you can do head -[n-1] file.csv > file_fixed.csv and tail -[N-n] file.csv >> file_fixed.csv (in both of these the expression in brackets is replaced by a plain number).

To do this, though you need to know N. The python solution is better...

edited Apr 21 '16 at 23:03

answered Apr 21 '16 at 21:08

jaimedash

2,683
17
30

1

`sed: couldn't write 35 items to stdout: No space left on device`. This is on a cloud drive with tons of storage so it's doing something in memory, it appears – hedgedandlevered Apr 21 '16 at 21:14
@hedgedandlevered hmm. actually are you sure it's not a storage issue? I just re-ran my above tests and sed does seem to use streaming -- memory usage never got above 600kb! lots of disk I/O though – jaimedash Apr 21 '16 at 23:05

jaimedash · Answer 3 · 2016-04-23T22:01:28.433

You can also do this in R by reading one line at a time with readLines and writing each line except one with writeLines

For a solution similar to the one in python:

con <- file('test.csv', 'r')
out_con <- file('tmp.csv', 'w')
bad_line <- 2636759 
ctr <- 1
while (length(line <- readLines(con, n=1, warn=FALSE)) > 0){
  if (ctr != bad_line)
    writeLines(line, out_con)
  ctr <- ctr + 1
}

but note this is MUCH slower than python (takes ~3m40s on the 40,000 line file where sed took 30s). Possibly increasing the chunk size n would speed things up, but sed or python seem much easier.

remove known exact row in huge csv

3 Answers3