11

All -

I am looking to create a pandas DataFrame from only the first and last lines of a very large csv. The purpose of this exercise is to be able to easily grab some attributes from the first and last entries in these csv files. I have no problem grabbing the first line of the csv using:

pd.read_csv(filename, nrows=1)

I also have no problem grabbing the last row of a text file in various ways, such as:

with open(filename) as f:
    last_line = f.readlines()[-1]

However, getting these two things into a single DataFrame has thrown me for a loop. Any insight into how best to achieve this goal?

EDIT NOTE: I am trying to achieve this task without loading all of the data into a single DataFrame first as I am dealing with pretty large (>15MM rows) csv files.

Thanks!

wrcobb
  • 543
  • 3
  • 7
  • 17
  • 1
    can you show the steps you have taken trying to get those two things into a single DataFrame? and what's the error you have got – Anzel Nov 07 '14 at 17:29
  • @Anzel the first snippet successfully grabs the first row and puts it in a DataFrame. The second snippet grabs the last line, but it lacks headers so I am not able to create a new DataFrame and use a simple concatenation to combine. – wrcobb Nov 07 '14 at 17:38

5 Answers5

14

Just use head and tail and concat. You can even adjust the number of rows.

import pandas as pd

df = pd.read_csv("flu.csv")
top = df.head(1)
bottom = df.tail(1)
concatenated = pd.concat([top,bottom])

print concatenated

Result:

           Date  Cases
0      9/1/2014     45
121  12/31/2014     97

Adjusting head and tail to take in 5 rows from top and 10 from bottom...

           Date  Cases
0      9/1/2014     45
1      9/2/2014    104
2      9/3/2014     47
3      9/4/2014    108
4      9/5/2014     49
112  12/22/2014     30
113  12/23/2014     81
114  12/24/2014     99
115  12/25/2014     85
116  12/26/2014     55
117  12/27/2014     91
118  12/28/2014     68
119  12/29/2014    109
120  12/30/2014     55
121  12/31/2014     97

One possible approach that can be used if you don't want to load the whole CSV file as a dataframe is to process them as CSVs alone. The following code is similar to your approach.

import pandas as pd
import csv

top = pd.read_csv("flu.csv", nrows=1)
headers = top.columns.values

with open("flu.csv", "r") as f, open("flu2.csv","w") as g:
    last_line = f.readlines()[-1].strip().split(",")
    c = csv.writer(g)
    c.writerow(headers)
    c.writerow(last_line)

bottom = pd.read_csv("flu2.csv")
concatenated = pd.concat([top, bottom])
concatenated.reset_index(inplace=True, drop=True)

print concatenated

Result is the same, except for the index. Tested against a million rows and it was processed in a about a second.

        Date  Cases
0   9/1/2014     45
1  7/25/4885     99
[Finished in 0.9s]

How it scales versus 15 million rows, maybe that's your ballgame now. So I decided to test it against exactly 15,728,626 rows and the results seem good enough.

        Date  Cases
0   9/1/2014     45
1  7/25/4885     99
[Finished in 3.3s]
WGS
  • 13,969
  • 4
  • 48
  • 51
  • Thanks for your response - any thoughts on how to do this without first loading the entire file into a DataFrame? I should have been more explicit in my post but these are very large (15MM row plus) csv files so I am trying to avoid loading the entire dataset. I'll edit my question as well. – wrcobb Nov 07 '14 at 17:34
  • 1
    Is the number of rows constant or does it change? – WGS Nov 07 '14 at 17:39
  • the number of rows is not constant. I am already counting the total number of rows in the analyzed file elsewhere in my script, so I do have that datapoint at my disposal. – wrcobb Nov 07 '14 at 17:40
  • 1
    if you know the number of lines, you can read the first line, then have another read which uses skiprows=n to read the last line. Then pd.concat() to bring them together. – JD Long Nov 07 '14 at 18:21
  • 1
    What @JDLong said. Possible drawbacks are you still have to resolve the issue of headers and possible str-int concatenation problems. – WGS Nov 07 '14 at 18:24
  • 1
    @wrcobb: See my edit above if it's able to scale to your requirement. – WGS Nov 07 '14 at 18:40
  • This is great - the edited solution is far more efficient than my old solution of loading the entire dataset into a DataFrame. Thanks! – wrcobb Nov 07 '14 at 19:00
  • 1
    Admittedly, I was a bit surprised that this was faster than just knowing the count ahead of time and using `skiprows`. Seems like sometimes it's really good to fall back to good old standard library modules. Good luck! – WGS Nov 07 '14 at 19:00
  • 1
    @wrcobb: Oh, as a matter of reference, what's your runtime for 15 million rows? How many columns are there as well? Just for kicks, I want to know how it scales. I'm surprised that from 1 to 15 million, the runtime just increased 200%. – WGS Nov 07 '14 at 19:02
  • @Metatron my runtime was right around 2 minutes - these files typically have 60-70 columns (in this case 66), so they are pretty big all around. This compares to about 20 minutes if I loaded the entire set into a DataFrame, IF it would even run before kicking a memory error. This method is far quicker and much less of a resource hog. – wrcobb Nov 07 '14 at 19:07
  • That's a massive improvement. Now it begs the question of whether using just the same approach for `top` would improve the time more. However, I seem to have read somewhere that `nrows` stops reading after the number of rows is reached, so it might be optimized well enough to read from the top. – WGS Nov 07 '14 at 19:10
6

This is the best solution I found

import pandas as pd

count=len(open(filename).readlines()) 

df=pd.read_csv(filename, skiprows=range(2,count-1), header=0)
Stefan Manole
  • 61
  • 1
  • 2
5

So the way to do this without reading in the whole file into Python first is to grab the first line then iterate through the file to the last line. Then use StringIO to suck them into Pandas. Maybe something like this:

import pandas as pd
import StringIO

with open('tst.csv') as f:
    first_line = f.readline()
    for line in f:
        pass #iterate to the end
    last_line = line

mydf = pd.DataFrame()
mydf = mydf.append(pd.read_csv(StringIO.StringIO(first_line), header=None))
mydf = mydf.append(pd.read_csv(StringIO.StringIO(last_line), header=None))
JD Long
  • 59,675
  • 58
  • 202
  • 294
2

You want this answer https://stackoverflow.com/a/18603065/4226476 - not the accepted answer but the best because it seeks backwards for the first newline instead of guessing.

Then wrap the two lines in a StringIO:

from cStringIO import StringIO
import pandas as pd

# grab the lines as per first-and-last-line question
truncated_input = StringIO(the_two_lines)
truncated_input.seek(0) # need to rewind
df = pd.read_csv(truncated_input)
Community
  • 1
  • 1
allen-smithee
  • 651
  • 3
  • 4
2

I had this problem too and went searching for a better solution.

The suggestion by Stefan Manole above is better than reading in the whole csv. Its about ~2x faster than reading in the whole csv file in my testing.

Using a csv writer like suggested above was faster again ~5x

The best method would surely be to use the tail head and sed unix commands. Tested over 20x faster!

import pandas as pd
import subprocess

filename = "csv_file.csv"

#Header
csv_header_str = subprocess.check_output(f"head -1 {filename}", shell=True).decode("utf-8").strip()
csv_header = csv_header_str.split(",")


#First line
csv_head = subprocess.check_output(f"sed -n '2p' {filename}", shell=True).decode("utf-8").strip()
head = csv_head.split(",")


#Last line
csv_tail = subprocess.check_output(f"tail -1 {filename}", shell=True).decode("utf-8").strip()
tail = csv_tail.split(",")

df = pd.DataFrame([head,tail], columns=csv_header)

I have a Github repo for this here with more functionality like reading n line from a csv into a DataFrame and handling data with/without headers. https://github.com/donjor/read-csv-turbo

I created a python module readcsvturbo (mainly just to try it out)

pip install readcsvturbo

import pandas as pd
import readcsvturbo as rct

filename = "csv_file.csv"
df = rct.read_csv_headtail(filename)

Hope this helps others who are in the same boat.

donjor
  • 21
  • 2