3

I am trying to remove multiple columns at one time without utilizing their field info. I get a CSV file from the county each month and there are multiple fields that I do not wish to share with the general public. Up until now, each month I manually delete each field. Since I am attempting to learn python, I would like to learn how to create a script to do it. There are 58 fields that I want to delete, so I don't want to write a script for each one but am hoping to create a range for deleting them. I have been searching the forums here for a few hours and have tried so many different methods that I don't know where to start or stop. Any help would be appreciated.

user2259051
  • 31
  • 1
  • 2
  • Have you tried this [Deleting columns - python][1] [1]: http://stackoverflow.com/questions/7588934/deleting-columns-in-a-csv-with-python – Verbatus Apr 08 '13 at 19:45
  • I did look at that. I have over 100 fields that I want to retain, so I am trying not to list each of them out. Unless I am misunderstanding all of the suggestions for that post, that is what I would need to do. I am willing to list them all but would rather avoid it if at all possible. – user2259051 Apr 08 '13 at 19:55

2 Answers2

6

I'm lazy, so I like to use existing libraries when I can, and have become something of an evangelist for the pandas library. Using @Tim Pietzcker's example:

Name,Sex,Address,Age
John,M,New York,40
Mary,F,Los Angeles,30

We can keep only the columns we want using:

import pandas as pd
df = pd.read_csv("to_remove.csv")
keep_cols = ["Name", "Address"]
new_df = df[keep_cols]
new_df.to_csv("removed.csv", index=False)

(We could also one-line it, but I think it's clearer like that.)


Explanation follows. First, we can read the file into a storage object called a DataFrame:

>>> import pandas as pd
>>> df = pd.read_csv("to_remove.csv")
>>> df
   Name Sex      Address  Age
0  John   M     New York   40
1  Mary   F  Los Angeles   30

We can select one or multiple columns from this object:

>>> df[["Name", "Sex"]]
   Name Sex
0  John   M
1  Mary   F

And then write it out:

>>> new_df = df[["Name", "Sex"]]
>>> new_df.to_csv("removed.csv", index=False)

(the index=False bit just tells it not to add a column counting the rows, the numbers 0, 1 above), producing

Name,Sex
John,M
Mary,F

We can also decide that we only want to keep columns starting with the letter "A":

>>> [col for col in df.columns if col.startswith("A")]
['Address', 'Age']
>>> df[[col for col in df.columns if col.startswith("A")]]
       Address  Age
0     New York   40
1  Los Angeles   30

or use the .ix method to keep only the columns from #1 to the next-to-last:

>>> df.ix[:,1:-1]
  Sex      Address
0   M     New York
1   F  Los Angeles

and so on.

DSM
  • 342,061
  • 65
  • 592
  • 494
1

Let's assume you have a CSV file like this:

Name,Sex,Address,Age
John,M,New York,40
Mary,F,Los Angeles,30

and you want to keep only the columns Name and Address.

Then you can do something like this (Python 3), making use of the extrasignore parameter of the DictWriter class:

import csv
fields = ["Name", "Address"]

with open("test.csv") as infile, open("out.csv", "w", newline="") as outfile:
    #           in Python 2, use open("out.csv", "wb") as outfile:
    r = csv.DictReader(infile)
    w = csv.DictWriter(outfile, fields, extrasaction="ignore")
    w.writeheader()
    for row in r:
        w.writerow(row)

Result:

Name,Address
John,New York
Mary,Los Angeles

If you want to do it the other way around, i. e. specify which columns to delete from the file, then it's a bit more complicated:

import csv
delete = ["Sex", "Age"]

with open("test.csv") as infile, open("out.csv", "w", newline="") as outfile:
    #           in Python 2, use open("out.csv", "wb") as outfile:
    r = csv.DictReader(infile)
    firstrow = next(r)  # Need to read the first row so we know the fieldnames
    fields = r.fieldnames
    w = csv.DictWriter(outfile, 
                       [field for field in fields if not field in delete], 
                       extrasaction="ignore")
    w.writeheader()
    w.writerow(firstrow)
    for row in r:
        w.writerow(row)
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561