2

My question is not how to open a .csv file, detect which rows I want to omit, and write a new .csv file with my desired lines. I'm already doing that successfully:

def sanitize(filepath): #Removes header information, leaving only column names and data.  Outputs "sanitized" file.
    with open(filepath) as unsan, open(dirname + "/" + newname + '.csv', 'w', newline='') as san:
        writer = csv.writer(san)
        line_count = 0
        headingrow = 0
        datarow = 0
        safety = 1
        for row in csv.reader(unsan, delimiter=','):

            #Detect data start
            if "DATA START" in str(row):
                safety = 0
                headingrow = line_count + 1
                datarow = line_count + 4

            #Detect data end
            if "DATA END" in str(row):
                safety = 1

            #Write data
            if safety == 0:
                if line_count == headingrow or line_count >= datarow:
                    writer.writerow(row)

            line_count += 1

I have .csv data files that are megabytes, sometimes gigabytes (up to 4Gb) in size. Out of 180,000 lines in each file, I only need to omit about 50 lines.

Example pseudo-data (rows I want to keep are indented):

[Header Start]  
...48 lines of header data...  
[Header End]  
Blank Line  
[Data Start]  
     Row with Column Names  
Column Units  
Column Variable Type  
     ...180,000 lines of data...

I understand that I can't edit a .csv file as I iterate over it (Learned here: How to Delete Rows CSV in python). Is there a quicker way to remove the header information from the file, like maybe writing the remaining 180,000 lines as a block instead of iterating through and writing each line?

Maybe one solution would be to append all the data rows to a list of lists and then use writer.writerows(list of lists) instead of writing them one at a time (Batch editing of csv files with Python, https://docs.python.org/3/library/csv.html)? However, wouldn't that mean I'm loading essentially the whole file (up to 4Gb) into my RAM?

UPDATE:
I've got a pandas import working, but when I time it, it takes about twice as long as the code above. Specifically, the to_csv portion takes about 10s for a 26Mb file.

import csv, pandas as pd
filepath = r'input'

with open(filepath) as unsan:
    line_count = 0
    headingrow = 0
    datarow = 0
    safety = 1

    row_count = sum(1 for row in csv.reader(unsan, delimiter=','))

    for row in csv.reader(unsan, delimiter=','):

        #Detect data start
        if "DATA START" in str(row):
            safety = 0
            headingrow = line_count + 1
            datarow = line_count + 4

        #Write data
        if safety == 0:
            if line_count == headingrow:
                colnames = row
                line_count +=1
                break

        line_count += 1

badrows = [*range(0, 55, 1),row_count - 1]
df = pd.read_csv(filepath, names=[*colnames], skiprows=[*badrows], na_filter=False)

df.to_csv (r'output', index = None, header=True)

Here's the research I've done:
Deleting rows with Python in a CSV file
https://intellipaat.com/community/18827/how-to-delete-only-one-row-in-csv-with-python
https://www.reddit.com/r/learnpython/comments/7tzbjm/python_csv_cleandelete_row_function_doesnt_work/
https://nitratine.net/blog/post/remove-columns-in-a-csv-file-with-python/
Delete blank rows from CSV?

TempleGuard527
  • 633
  • 6
  • 18
  • Have you read some about pandas? – GiovaniSalazar Nov 19 '19 at 16:40
  • 2
    Have you tried using `writer.writerows(list of lists)`? Is it any faster? When you say that your current method is slow, I don't doubt it: Iterating over 180,000+ rows in python is excruciatingly slow. Pandas will load stuff for you in C land, where things are quick. Your operations will be almost instantaneous there. (but saving won't be much quicker I'm afraid) – otoomey Nov 19 '19 at 16:44
  • 1
    @GiovaniSalazar how is that relevant here? – juanpa.arrivillaga Nov 19 '19 at 16:44
  • "sometimes gigabytes (up to 4Gb) in size"...with library csv I think is not enough, I would recommend reviewing pandas as an option – GiovaniSalazar Nov 19 '19 at 16:47
  • 1
    `pandas.read_csv` has a parameter called `chunksize` where you can specify how many rows you would like to read at once also if you're getting into ram-problems. – Hampus Larsson Nov 19 '19 at 16:47
  • I've tried loading the file directly into pandas, but... I got errors I didn't understand (embarrassment). With the header information, I couldn't figure out how to get pandas to recognize the column names. Once the file has run through the program I copied in the question, I can get it into pandas just fine (because it's only the column names and the data). I actually load it into pandas and do math to detect the data recording rate. – TempleGuard527 Nov 19 '19 at 16:54
  • So a little bit more detail on the pandas import. Here's the error I get: "Error tokenizing data. C error: Expected 1 fields in line 53, saw 25." I'm presuming I get this error because the first 52 rows of the .csv only have data in the first column. After that, the actual test data starts, and it jumps out to 25 columns. I'll start working on how to make pandas happy with that. – TempleGuard527 Nov 19 '19 at 19:00

2 Answers2

1

If it is not important that the file is read in Python, or with a CSV reader/writer, you can use other tools. On *nix you can use sed:

sed -n '/DATA START/,/DATA END/p' myfile.csv > headerless.csv

This will be very fast for millions of lines.

perl is more multi-platform:

perl -F -lane "print if /DATA START/ .. /DATA END/;" myfile.csv

gens
  • 972
  • 11
  • 22
1

To avoid editing the file, and read the file with headers straight into Python and then into Pandas, you can wrap the file in your own file-like object.

Given an input file called myfile.csv with this content:

HEADER
HEADER
HEADER
HEADER
HEADER
HEADER

now, some, data
1,2,3
4,5,6
7,8,9

You can read that file in directly using a wrapper class:

import io

class HeaderSkipCsv(io.TextIOBase):
    def __init__(self, filename):
        """ create an iterator from the filename """
        self.data = self.yield_csv(filename)

    def readable(self):
        """ here for compatibility """
        return True

    def yield_csv(self, filename):
        """ open filename and read past the first empty line
        Then yield characters one by one. This reads just one
        line at a time in memory
        """
        with open(filename) as f:
            for line in f:
                if line.strip() == "":
                    break
            for line in f:
                for char in line:
                    yield char

    def read(self, n=None):
        """ called by Pandas with some 'n', this returns
        the next 'n' characters since the last read as a string
        """
        data = ""
        for i in range(n):
            try:
                data += next(self.data)
            except StopIteration:
                break
        return data

WANT_PANDAS=True #set to False to just write file

if WANT_PANDAS:
    import pandas as pd
    df = pd.read_csv(HeaderSkipCsv('myfile.csv'))
    print(df.head(5))
else:
    with open('myoutfile.csv', 'w') as fo:
        with HeaderSkipCsv('myfile.csv') as fi:
            c = fi.read(1024)
            while c:
                fo.write(c)
                c = fi.read(1024)

which outputs:

   now   some   data
0    1      2      3
1    4      5      6
2    7      8      9

Because Pandas allows any file-like object, we can provide our own! Pandas calls read on the HeaderSkipCsv object as it would on any file object. Pandas just cares about reading valid csv data from a file object when it calls read on it. Rather than providing Pandas with a clean file, we provide it with a file-like object that filters out the data Pandas does not like (i.e. the headers).

The yield_csv generator iterates over the file without reading it in, so only as much data as Pandas requests is loaded into memory. The first for loop in yield_csv advances f to beyond the first empty line. f represents a file pointer and is not reset at the end of a for loop while the file remains open. Since the second for loop receives f under the same with block, it starts consuming at the start of the csv data, where the first for loop left it.

Another way of writing the first for loop would be

next((line for line in f if line.isspace()), None)

which is more explicit about advancing the file pointer, but arguably harder to read.

Because we skip the lines up to and including the empty line, Pandas just gets the valid csv data. For the headers, no more than one line is ever loaded.

gens
  • 972
  • 11
  • 22
  • thanks so much for putting this together! I'm having a little bit of trouble understanding the process in your answer. Is this right? Database df is created with read_csv from a file we've processed with the HeaderSkipCsv class. That class goes something like this: __init__ is called when an object is created, calls the yield_csv function within the class on the object. Here's where I'm confused - why doesn't the second "for" in yield_csv start back at the beginning of the file, and when does the HeaderSkipCsv.read function get called? – TempleGuard527 Nov 20 '19 at 15:22
  • 1
    @TempleGuard527 I updated the answer to clarify when `read` is called and how `f` is advanced. – gens Nov 20 '19 at 17:09
  • Thanks! I think, though, that I have a satisfactory solution for avoiding the header by using skiprows from pandas (I included that code in an update in the question). To be fair, you did answer the question - this removes the header quickly. However, I'm still on the hunt for a quick way to output the large file after it's stripped of the header. – TempleGuard527 Nov 25 '19 at 19:39
  • @TempleGuard527 I updated my answer to show how to output the file without involving Pandas at all. You can play with the `1024` chunk size to see if it impacts performance on your system, basically setting it to however much RAM you wish to use. – gens Nov 25 '19 at 21:48