I am trying to remove multiple columns at one time without utilizing their field info. I get a CSV file from the county each month and there are multiple fields that I do not wish to share with the general public. Up until now, each month I manually delete each field. Since I am attempting to learn python, I would like to learn how to create a script to do it. There are 58 fields that I want to delete, so I don't want to write a script for each one but am hoping to create a range for deleting them. I have been searching the forums here for a few hours and have tried so many different methods that I don't know where to start or stop. Any help would be appreciated.
-
Have you tried this [Deleting columns - python][1] [1]: http://stackoverflow.com/questions/7588934/deleting-columns-in-a-csv-with-python – Verbatus Apr 08 '13 at 19:45
-
I did look at that. I have over 100 fields that I want to retain, so I am trying not to list each of them out. Unless I am misunderstanding all of the suggestions for that post, that is what I would need to do. I am willing to list them all but would rather avoid it if at all possible. – user2259051 Apr 08 '13 at 19:55
2 Answers
I'm lazy, so I like to use existing libraries when I can, and have become something of an evangelist for the pandas library. Using @Tim Pietzcker's example:
Name,Sex,Address,Age
John,M,New York,40
Mary,F,Los Angeles,30
We can keep only the columns we want using:
import pandas as pd
df = pd.read_csv("to_remove.csv")
keep_cols = ["Name", "Address"]
new_df = df[keep_cols]
new_df.to_csv("removed.csv", index=False)
(We could also one-line it, but I think it's clearer like that.)
Explanation follows. First, we can read the file into a storage object called a DataFrame
:
>>> import pandas as pd
>>> df = pd.read_csv("to_remove.csv")
>>> df
Name Sex Address Age
0 John M New York 40
1 Mary F Los Angeles 30
We can select one or multiple columns from this object:
>>> df[["Name", "Sex"]]
Name Sex
0 John M
1 Mary F
And then write it out:
>>> new_df = df[["Name", "Sex"]]
>>> new_df.to_csv("removed.csv", index=False)
(the index=False
bit just tells it not to add a column counting the rows, the numbers 0, 1 above), producing
Name,Sex
John,M
Mary,F
We can also decide that we only want to keep columns starting with the letter "A":
>>> [col for col in df.columns if col.startswith("A")]
['Address', 'Age']
>>> df[[col for col in df.columns if col.startswith("A")]]
Address Age
0 New York 40
1 Los Angeles 30
or use the .ix
method to keep only the columns from #1 to the next-to-last:
>>> df.ix[:,1:-1]
Sex Address
0 M New York
1 F Los Angeles
and so on.

- 342,061
- 65
- 592
- 494
Let's assume you have a CSV file like this:
Name,Sex,Address,Age
John,M,New York,40
Mary,F,Los Angeles,30
and you want to keep only the columns Name
and Address
.
Then you can do something like this (Python 3), making use of the extrasignore
parameter of the DictWriter
class:
import csv
fields = ["Name", "Address"]
with open("test.csv") as infile, open("out.csv", "w", newline="") as outfile:
# in Python 2, use open("out.csv", "wb") as outfile:
r = csv.DictReader(infile)
w = csv.DictWriter(outfile, fields, extrasaction="ignore")
w.writeheader()
for row in r:
w.writerow(row)
Result:
Name,Address
John,New York
Mary,Los Angeles
If you want to do it the other way around, i. e. specify which columns to delete from the file, then it's a bit more complicated:
import csv
delete = ["Sex", "Age"]
with open("test.csv") as infile, open("out.csv", "w", newline="") as outfile:
# in Python 2, use open("out.csv", "wb") as outfile:
r = csv.DictReader(infile)
firstrow = next(r) # Need to read the first row so we know the fieldnames
fields = r.fieldnames
w = csv.DictWriter(outfile,
[field for field in fields if not field in delete],
extrasaction="ignore")
w.writeheader()
w.writerow(firstrow)
for row in r:
w.writerow(row)

- 328,213
- 58
- 503
- 561