My question is not how to open a .csv file, detect which rows I want to omit, and write a new .csv file with my desired lines. I'm already doing that successfully:
def sanitize(filepath): #Removes header information, leaving only column names and data. Outputs "sanitized" file.
with open(filepath) as unsan, open(dirname + "/" + newname + '.csv', 'w', newline='') as san:
writer = csv.writer(san)
line_count = 0
headingrow = 0
datarow = 0
safety = 1
for row in csv.reader(unsan, delimiter=','):
#Detect data start
if "DATA START" in str(row):
safety = 0
headingrow = line_count + 1
datarow = line_count + 4
#Detect data end
if "DATA END" in str(row):
safety = 1
#Write data
if safety == 0:
if line_count == headingrow or line_count >= datarow:
writer.writerow(row)
line_count += 1
I have .csv data files that are megabytes, sometimes gigabytes (up to 4Gb) in size. Out of 180,000 lines in each file, I only need to omit about 50 lines.
Example pseudo-data (rows I want to keep are indented):
[Header Start]
...48 lines of header data...
[Header End]
Blank Line
[Data Start]
Row with Column Names
Column Units
Column Variable Type
...180,000 lines of data...
I understand that I can't edit a .csv file as I iterate over it (Learned here: How to Delete Rows CSV in python). Is there a quicker way to remove the header information from the file, like maybe writing the remaining 180,000 lines as a block instead of iterating through and writing each line?
Maybe one solution would be to append all the data rows to a list of lists and then use writer.writerows(list of lists)
instead of writing them one at a time (Batch editing of csv files with Python, https://docs.python.org/3/library/csv.html)? However, wouldn't that mean I'm loading essentially the whole file (up to 4Gb) into my RAM?
UPDATE:
I've got a pandas import working, but when I time it, it takes about twice as long as the code above. Specifically, the to_csv portion takes about 10s for a 26Mb file.
import csv, pandas as pd
filepath = r'input'
with open(filepath) as unsan:
line_count = 0
headingrow = 0
datarow = 0
safety = 1
row_count = sum(1 for row in csv.reader(unsan, delimiter=','))
for row in csv.reader(unsan, delimiter=','):
#Detect data start
if "DATA START" in str(row):
safety = 0
headingrow = line_count + 1
datarow = line_count + 4
#Write data
if safety == 0:
if line_count == headingrow:
colnames = row
line_count +=1
break
line_count += 1
badrows = [*range(0, 55, 1),row_count - 1]
df = pd.read_csv(filepath, names=[*colnames], skiprows=[*badrows], na_filter=False)
df.to_csv (r'output', index = None, header=True)
Here's the research I've done:
Deleting rows with Python in a CSV file
https://intellipaat.com/community/18827/how-to-delete-only-one-row-in-csv-with-python
https://www.reddit.com/r/learnpython/comments/7tzbjm/python_csv_cleandelete_row_function_doesnt_work/
https://nitratine.net/blog/post/remove-columns-in-a-csv-file-with-python/
Delete blank rows from CSV?