Deleting "string" containing last rows from CSV file using regex

Question

I am new to Python. I have thousands of CSV files, in which, there is a group of text that comes after the numeric data are logged and I would like to remove all the rows onwards that begin with text. For example:

col 1    col 2    col 3
--------------------
10      20        30
--------------------
45      34        56
--------------------
Start   8837sec    9items
--------------------
Total   6342sec   755items

The good thing is that the text for all the csv files begin with "Start" in column1. I would prefer removing all the rows afterwards including the row that says "Start".

Here is what I have so far:

import csv, os, re, sys


fileList = []

pattern = [r"\b(Start).*", r"\b(Total).*"]

for file in files:
    fullname = os.path.join(cwd, file)

    if not os.path.isdir(fullname) and not os.path.islink(fullname):
        fileList.append(fullname)


for file in fileList:
    try:
        ifile = open(file, "r")
    except IOError:
        sys.stderr.write("File %s not found! Please check the filename." %(file))
        sys.exit()
    else:
        with ifile:
            reader = csv.reader(ifile)
            writer = csv.writer(ifile)
            rowList = []     
            for row in reader:
               rowList.append((", ".join(row)))

        for pattern in word_pattern:
             if not (re.match(pattern, rowList)
                writer.writerow(elem)

After running this script, it gives me blank csv file. Any idea what to change?

There is no variable named `writer` in this example. You should get an exception and nothing written. You just want to strip everything after `START`? You don't need csv for that. — tdelaney, Feb 26 '17 at 02:02
I have added writer statement in the code. The encoding of CSV file is in ASCII format. — SalN85, Feb 26 '17 at 07:23

score 0 · Accepted Answer · answered Feb 26 '17 at 02:31

You don't need the CSV reader for this. You could simply find the offset and truncate the file. Open the file in binary mode and use a multi-line regex to find the pattern in the text and use its index.

import os
import re

# multiline, ascii only regex matches Start or Total at start of line
start_tag_finder = re.compile(rb'(?am)\nStart|\nTotal').search

for filename in files: # TODO: I'm not sure where "files" comes from...
    # NOTE: no need to join cwd, relative paths do that automatically
    if not os.path.isdir(filename) and not os.path.islink(filename):
        with open(filename, 'rb+') as f:
            # NOTE: you can cap file size if you'd like
            if os.stat(filename).st_size > 1000000:
                print(filename, "overflowed 10M size limit")
                continue
            search = start_tag_finder(f.read())
            if search:
                f.truncate(search.start())

Hi tdelaney...Thanks, it does work :). A quick question: Would the method of string: (string.startswith(keywords)) work in these cases as well, where my keywords are keywords = ("Search", "Total")? — SalN85, Feb 26 '17 at 07:26
This example processes the file in a block, not line by line, so `startswith` doesn't work, but `"\nStart" in f.read()` would. The regex lets you check multiple keywords at once in a single C extension block which I assume is faster. On most modern computers, burning a few meg of RAM to read a file is trivial and this (a guess!) should have good performance. You could read line by line and do `startswith` also. — tdelaney, Feb 26 '17 at 15:55

pstatix · Answer 2 · 2017-02-27T12:28:04.643

0

I would try this for everything after you get your fileList together:

for file in fileList:
    keepRows = []
    open(file, 'r') as oFile:
    for row in oFile:
        if row[0] != "Start":
            keepRows += row
        else:
            oFile.close()
    with open(file, 'wb+') as nFile:
    writer = csv.writer(nFile, delimiter=',')
    writer.writerow([keepRows])

This opens your original file, gets the lines you wants, closes it and opens it with the w+. This overwrites the file, keeping the name, but clears it out via truncate and then will write each of the rows you wanted to keep on each row of the cleared out file.

Alternatively, you could create a new file for each csv doing:

for file in fileList:
    keepRows = []
    with open(file, 'r') as oFile, open('new_file.csv', 'a') as nFile:
    for row in oFile:
        if row[0] != "Start":
            keepRows += row
        else:
            oFile.close()
    for row in keepRows:
        nFile.write(row)

Opening with a puts the cursor in the next row each time since this is append. The .writerow method before users iterables which is why it is in [] for the object where as each group, or row, in keepRows while in append does not need iterables and will write each item within the grouping to its own column, move to the next row and do the same thing.

EDIT: Updated syntax for binary file mode and .writer().

edited Feb 27 '17 at 12:28

answered Feb 26 '17 at 02:55

pstatix

3,611
4
18
40

hi pstatix, thanks for your help. I understood your first method for creating out a new list by isolating anything after "Start". But, when you start overwriting the file, keeping the name, I don't see the truncate option. Also, I believe csv.writer() should take 'nfile' as an argument? – SalN85 Feb 26 '17 at 07:42
@Salil Nanda, I updated the `.writer()` piece as you were correct, I forgot to supply it with a file object. The truncate option is based on the mode in which you call the `open()` function. Using `w` means `write` using `b` means the file is opened in `binary` mode and using `+` enables `read-write` updating capability. By default, `w+` overwrites a file to 0 bytes (i.e. truncates it). This is why we call the `wb+` mode after we have gathered your desired rows. The reason we use `b` is so that the Windows OS can interpret the `new line` aspect of the file. – pstatix Feb 27 '17 at 12:33
@Salil Nandra, response too long. Here are some references for `open()` modes: 1) http://stackoverflow.com/questions/16208206/confused-by-python-file-mode-w. 2) https://docs.python.org/2/library/functions.html#open – pstatix Feb 27 '17 at 12:34
thanks very much for your help. I will look into it. – SalN85 Feb 27 '17 at 18:00

Deleting "string" containing last rows from CSV file using regex

2 Answers2