Splitting & Editing CSV Column & Arranging In Alphabetical Order

Question

I have developed the following .py file for a CSV file with a number of columns and thousands of rows of data. Here is the script I have so far:

infile = open("titanic.csv", "rU")
incsv = csv.reader(infile, delimiter = ',')
outfile = open("titanicOutput.csv", "w")
outcsv = csv.writer(outfile, delimiter = ',')
header = incsv.next()

rowNum = 0
for row in incsv:
(data1, data2, namedata, data4, data5, data6, data7, data8, data9, data10, data11) = row
if '1' in data1:
    rowOutput = [namedata, data2, data4, data5]
    outcsv.writerow(rowOutput)
    rowNum += 1

infile.close()
outfile.close()

Basically the information of namedata column is presented for everyones full name like this "Smith, John". The last name is first followed by first name. I need to separate lastname and firstname and create a column for each in the output - with no comma or quotation marks that already exist. I also need to then present the information with the lastname column in alphabetical order. I know sort() will be used in some capacity to order alphabetically but the splitting I have no idea.

I got this far but have no idea how to split the namedata column - there was one explanation on here I read for a similar problem but it was too complex for me to comprehend in all honesty. Dumbed down explanation would be amazing, thanks!

EDIT: Original File Data (Simplified version for illustration) -
data1   data2   namedata               data4    data5
0         3     Smith, Mr John           m       22
1         1     McMahan, Ms Sally        f       38
1         3     Emmit, Mr Brandon        f       26

Output csv File (Simplified version for illustration) -
lastname    firstname      data2    data4
Emmit       Mr Brandon       3        m
McMahon     Ms Sally         1        f
Smith       Mr John          3        f

Hope that helps!

Absolutely, this might be a really dumb question but how do I attach files to this post? — lonewolf2288, May 10 '16 at 07:05

Burhan Khalid · Answer 1 · 2016-05-10T04:54:24.650

You can split the data using the appropriately named .split method of strings, like this:

>>> namedata = 'Smith, John'
>>> last,first = namedata.split(',')
>>> last
'Smith'
>>> first
' John'

You also don't need the rowNum tracker (you don't seem to use it anywhere). Try this version:

import csv

rows = []  
with open("titanic.csv", "rU") as infile:
    reader = csv.reader(infile, delimiter=',')
    next(reader)
    for row in reader:
        last,first = row[2].split(',')
        rows.append([last, first, row[1], row[3], row[4]])

# Sort the rows by last name
sorted_rows = sorted(rows, key=lamdba x: x[0])

with open("titanicOutput.csv", "w") as outfile:
   writer = csv.writer(outfile, delimiter=',')
   writer.writerows(sorted_rows)

print('Done')

Python knows how to sort most things. For example, if you pass it a list of names, it knows how to sort alphabetically:

>>> names = ['Zack', 'John', 'David']
>>> sorted(names)
['David', 'John', 'Zack']

You can also tell it to sort in reverse order:

>>> sorted(names, reverse=True)
['Zack', 'John', 'David']

This works fine for simple lists, however in your case you have a list of lists, so you need to tell Python what to use when it sorts.

This is what the key argument is for. You pass this argument a function that returns the object that you want to sort by. This function will be called with each item in the list, and it should return the thing that Python will use to sort.

In our case, we want to sort by the last name, which is the first item for each list in our list.

Our data looks like this:

[['John', 'Smith', 1, 3, 4], ['Avery', 'Jones', 1, 3, 4]]

We want to sort by the first value of each inner list (which is the last name). The function we write will be passed each item (list), so we need to just return the first item:

def sort_by(item):
   return item[0]

sorted(names, key=sort_by)

Now sorted works like we want:

>>> names = [['John', 'Smith', 1, 3, 4], ['Avery', 'Jones', 1, 3, 4]]
>>> def sort_by(item):
...   return item[0]
...
>>> sorted(names, key=sort_by)
[['Avery', 'Jones', 1, 3, 4], ['John', 'Smith', 1, 3, 4]]

A lambda is just a shortcut way to write a function. Since we won't likely be using the sort_by method anywhere else other than for the purposes of sorting, we don't really need to define it. We can just transform it into a lambda and pass it directly:

>>> sorted(names, key=lambda item: item[0])
[['Avery', 'Jones', 1, 3, 4], ['John', 'Smith', 1, 3, 4]]

score 0 · Answer 2 · edited May 23 '17 at 12:16

If I understand correctly, you have a field like:

name = "Smith, John"

But you want a list like:

["John", "Smith"]

For that, you could do something to the tune of:

first_last = names.split(', ')
first_last.reverse()
print first_last

For sorting, there's bound to be lots of ways and this may not be the most elegant, but you could create a dict, sort the keyset, then print out the corresponding values:

phonebook = dict()

for row in csv:
    last_name = get_last_name()
    phonebook[last_name] = row

lastnames = phonebook.keys()
lastnames.sort()
for key in lastnames:
    print phonebook[key]

Where the latter is taken almost wholesale from https://stackoverflow.com/a/13990710/695787 . Probably fails for duplicate last names, though.

Splitting & Editing CSV Column & Arranging In Alphabetical Order

2 Answers2