Sorting CSV data using Python

Question

I'm really desperate to get some data sorted. I have some data that I would like to sort in CSV using Python. If anyone could help, it would be much appreciated. It can't be sorted in Excel because there are too many rows (>15 million).

The format is as below:

X,Y,Level,ID,XYID
15.5,16.5,1.6,HB01,15.516.5
15.5,17.5,1.4,HB01,15.517.5
15.5,18.5,1.7,HB01,15.518.5
15.5,19.5,1.6,HB01,15.519.5
15.5,20.5,1.2,HB01,15.520.5
15.5,20.5,1.9,HB02,15.520.5
15.5,20.5,2.5,HB03,15.520.5
15.5,20.5,2.1,HB04,15.520.5

The XYID is just a concatenated text of X and Y. I want the data to be sorted so that in the second block, the only row that comes out is as below because it has the highest level (third column) and has the same XYID

15.5,20.5,2.5,HB03,15.520.5

And I want the final output to be a csv file with showing the first four rows because they have different XYID and the new 5th row with the one that shows the maximum level:

X,Y,Level,ID,XYID
15.5,16.5,1.6,HB01,15.516.5
15.5,17.5,1.4,HB01,15.517.5
15.5,18.5,1.7,HB01,15.518.5
15.5,19.5,1.6,HB01,15.519.5
15.5,20.5,2.5,HB03,15.520.5

What have you tried? Have you looked at the `csv` module? We are not here to do your job for you, please tells us where you got stuck. — Winston Ewert, Jul 11 '12 at 17:26
Have you tried suggestions from http://stackoverflow.com/questions/2089036/sorting-csv-in-python and http://stackoverflow.com/questions/2100353/sort-csv-by-column ? — Sudhir Krishnan, Jul 11 '12 at 17:30
How representative is your example data? Is your whole data also sorted in column order? — Marco de Wit, Jul 11 '12 at 17:31

Andrew Clark · Answer 1 · 2012-07-11T17:46:28.013

Something like the following should work:

import csv
import itertools

reader = csv.DictReader(open('input.csv', 'rb'))
groups = itertools.groupby(reader, lambda d: d['XYID'])
result = [max(g, key=lambda d: float(d['Level'])) for k, g in groups]

writer = csv.DictWriter(open('output.csv', 'wb'), reader.fieldnames)
writer.writeheader()
writer.writerows(result)

The idea here is to first group the rows based on their XYID value, then take the maximum row for each group using the Level value.

If your input CSV file does not already have rows grouped by the XYID value, you will first need to sort the rows so that they are:

reader = csv.DictReader(open('input.csv', 'rb'))
rows = sorted(reader, key=lambda d: d['XYID'])
groups = itertools.groupby(rows, lambda d: d['XYID'])
...

dm03514 · Answer 2 · 2012-07-11T17:39:07.773

You can use the csv module to read all rows.

import csv
import decimal
from operator import itemgetter

f = open('your_file')
csv_reader = csv.reader(f)

rows_list = []

convert each rows XYID value to a decimal

for row in csv_reader:
  row[4] = decimal.Decimal(row[4])

PUt all rows in a list

rows_list.append(row)

sort on the XYID key and write to output file.

rows_list.sort(key=itemgetter(4))

I don't know how much memory 15million rows would take up or how much you have available to you so perhaps you might even write the values to a sqlite database which python comes with support for?? It would be very simple to write the data to the db and select all data sorting on XYID

score 1 · Answer 3 · answered Jul 11 '12 at 17:30

You might want to look at pandas, which can read CSV files and create a tabular data structure which can be sorted, etc. It's pretty good at handling large data sets, although 15 million rows is pretty dang big, so it's always going to depend on how much memory you have, etc.

Marco de Wit · Answer 4 · 2012-07-11T17:51:42.243

0

assuming that your data still needs to be sorted:

from itertools import groupby

sorter = lambda r : r[4]

sorteddata = sorted(data, key=sorter)

newdata = (max(g, key=lambda r : float(r[2])) for g in groupby(sorteddata, sorter))

edited Jul 11 '12 at 17:51

answered Jul 11 '12 at 17:45

Marco de Wit

2,686
18
22

Sorting CSV data using Python

4 Answers4