Just use head
and tail
and concat
. You can even adjust the number of rows.
import pandas as pd
df = pd.read_csv("flu.csv")
top = df.head(1)
bottom = df.tail(1)
concatenated = pd.concat([top,bottom])
print concatenated
Result:
Date Cases
0 9/1/2014 45
121 12/31/2014 97
Adjusting head
and tail
to take in 5 rows from top and 10 from bottom...
Date Cases
0 9/1/2014 45
1 9/2/2014 104
2 9/3/2014 47
3 9/4/2014 108
4 9/5/2014 49
112 12/22/2014 30
113 12/23/2014 81
114 12/24/2014 99
115 12/25/2014 85
116 12/26/2014 55
117 12/27/2014 91
118 12/28/2014 68
119 12/29/2014 109
120 12/30/2014 55
121 12/31/2014 97
One possible approach that can be used if you don't want to load the whole CSV file as a dataframe is to process them as CSVs alone. The following code is similar to your approach.
import pandas as pd
import csv
top = pd.read_csv("flu.csv", nrows=1)
headers = top.columns.values
with open("flu.csv", "r") as f, open("flu2.csv","w") as g:
last_line = f.readlines()[-1].strip().split(",")
c = csv.writer(g)
c.writerow(headers)
c.writerow(last_line)
bottom = pd.read_csv("flu2.csv")
concatenated = pd.concat([top, bottom])
concatenated.reset_index(inplace=True, drop=True)
print concatenated
Result is the same, except for the index. Tested against a million rows and it was processed in a about a second.
Date Cases
0 9/1/2014 45
1 7/25/4885 99
[Finished in 0.9s]
How it scales versus 15 million rows, maybe that's your ballgame now.
So I decided to test it against exactly 15,728,626 rows and the results seem good enough.
Date Cases
0 9/1/2014 45
1 7/25/4885 99
[Finished in 3.3s]