2

I have about 100 CSV files I have to operate on once a month and I was trying to wrap my head around this but I'm running into a wall. I'm starting to understand some things about Python, but combining several things is still giving me issues, so I can't figure this out.

Here's my problem:

I have many CSV files, and here's what I need done:

add a "column" to the front of each row (or the back, doesn't matter really, but front is ideal). In addition, each line has 5 rows (not counting the filename that will be added), and here's the format:

6-digit ID number,YYYY-MM-DD(1),YYYY-MM-DD(2),YYYY-MM-DD(3),1-2-digit number

I need to subtract YYYY-MM-DD(3) from YYYY-MM-DD(2) for every line in the file (there is no header row), for every CSV in a given directory.

I need the filename inside the row because I will combine the files (which, if is included in the script would be awesome, but I think I can figure that part out), and I need to know what file the records came from. Format of filename is always '4-5-digit-number.csv'

I hope this makes sense, if it does not, please let me know. I'm kind of stumped as to where to even begin, so I don't have any sample code that even really began to work for me. Really frustrated, so I appreciate any help you guys may provide, this site rocks!

Mylan

mrmylanman
  • 23
  • 1
  • 4
  • I don't understand why you would need the filename inside the file itself. When you open the file, you know its name. – Tim Pietzcker Sep 08 '11 at 15:54
  • `In addition, each line has 5 rows`, do you mean columns? – MattH Sep 08 '11 at 16:02
  • I will ultimately be combining all files into one, so I will need to know which row comes from what file. Each file basically outlines some data from a particular facility, and I need to know which facility that is for, so that's why I wanted the filename in each row. And I did mean column, sorry. – mrmylanman Sep 08 '11 at 16:17

2 Answers2

7

There's a tool in the standard library for each of these tasks:

To iterate over all CSV files in a directory, use the glob module:

import glob
for csvfilename in glob.glob(r"C:\mydirectory\*.csv"):
    #do_something

To parse a CSV file, use the csv module:

import csv
with open(csvfilename, "rb") as csvfile:
    reader = csv.reader(csvfile, delimiter=",")
    for row in reader:
        # row is a list of all the entries in the current row

To parse a date and calculate a difference, use the datetime module:

from datetime import datetime
startdate = datetime.strptime("1999-10-20", "%Y-%m-%d")
enddate = datetime.strptime("2003-02-28", "%Y-%m-%d")
delta = enddate - startdate # difference in days

To add a value to the beginning of a row:

row[0:0] = [str(delta)]

To append the filename to the end of a row:

row.append(csvfilename)

And to write a row to a new CSV file:

with open(csvfilename, "wb") as csvfile:
    writer = csv.writer(csvfile, delimiter=",")
    writer.writerow(row)

Taken all together, you get:

import glob
import csv
from datetime import datetime

with open("combined_files_csv", "wb") as outfile:
    writer = csv.writer(outfile, delimiter=",")
    for csvfilename in glob.glob(r"C:\mydirectory\*.csv"):
        with open(csvfilename, "rb") as infile:
            reader = csv.reader(infile, delimiter=",")
            for row in reader:
                startdate = datetime.strptime(row[3], "%Y-%m-%d")
                enddate = datetime.strptime(row[2], "%Y-%m-%d")
                delta = enddate - startdate # difference in days
                row[0:0] = [str(delta)]
                row.append(csvfilename)
                writer.writerow(row)
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • 2
    Awesome, thank you for these tips, I greatly appreciate the direction – mrmylanman Sep 08 '11 at 16:18
  • I tried out your script, and it seems like it does exactly what I need, however when it gets to line 37 ("for row in reader:"), it throws the error "_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)". I am using Mac OS 10.6 and Python 3.2.1 if that makes a difference. Thanks – mrmylanman Sep 08 '11 at 16:57
  • Actually I got it going now. I overlooked changing "combined_files_csv" from a string to an actual file. Works like a charm now, thanks a ton! This also gives me insight into how to do other tasks I have to do on CSVs regularly. – mrmylanman Sep 08 '11 at 17:15
  • Ah, yes; in Python 3, the csv module works a little differently; you don't open the files in binary mode, and there is an additional parameter `newline`; see [the docs](http://docs.python.org/py3k/library/csv.html). So you would do `open("file.csv", "w", newline = "")` and `writer = csv.writer(outfile, delimiter = "")`. It's a good idea to use the `python-3.x` tag in addition to the `python` tag because without it, most people assume you're talking about Python 2. – Tim Pietzcker Sep 08 '11 at 17:28
  • Thanks again, I'm still trying to get used to the differences of the two. I figured it'd be best for future-proofing myself to work in Python3, but it's a little tricky sometimes as most of the information I find that I reference is written for 2.x and I have to change it to work with 3.x. Thanks for all the help! – mrmylanman Sep 08 '11 at 17:37
0

The basic outline of the program is going to be like this:

  1. Use the os module to get the filenames out of the directory/directories of interest
  2. Read in each file one at a time
  3. For each line in the file, split it into columns with columns = line.split(",")
  4. Use datetime.date to convert strings like "2011-05-03" to datetime.dates.
  5. Subtract the third date from the second, which yields a datetime.timedelta.
  6. Put all your information in the format you want (hint: str(foo) yields a string representation of foo, for just about any type) and remember it for later
  7. Close your file, reopen it for writing, and write your new stuff in
  • I'd suggest replacing step 7 with `write your new stuff out to a temporary file`. Then `8. Rename original file to .bak suffixed name`. `9. Rename temporary file to original filename`. So as to minimise risk of losing data. – MattH Sep 08 '11 at 16:17
  • Thank you very much, this is a big help – mrmylanman Sep 08 '11 at 16:18