0

My CSV file is like this.

0.0063,0.0121,band -> mcr music
0.0061,0.0123,band -> mcr
0.0062,0.0122,band -> orchestra

How can I sort the first column of the csv file and print the each line? So, in this case the final output should be

0.0061,0.0123,band -> mcr
0.0062,0.0122,band -> orchestra
0.0063,0.0121,band -> mcr music
martineau
  • 119,623
  • 25
  • 170
  • 301
Viral Patel
  • 57
  • 2
  • 9
  • 1
    How big is the CSV file you're trying to sort (number of lines as well as disk size)? If it's small, you might just load it into memory via Pandas. If it's extremely large, you'll need to get creative and do a sort without loading the entire thing. If the first column always has the same number of digits, and you have roughly 3-4 times the filesize as free space, I'd personally recommend doing a Radix sort. – AmphotericLewisAcid May 19 '18 at 01:01

2 Answers2

2

A csv is basically a python array of arrays (matrix). That said your data would actually look like following:

csv = [
    [0.0063, 0.0121, 'band -> mcr music'],
    [0.0061, 0.0123, 'band -> mcr'],
    [0.0062, 0.0122, 'band -> orchestra']
]

Then you can think of sorting from the ith column as sorting a list of tuples. You would do:

csv = sorted(csv, key=lambda x: x[0])

Alternatively you can use the array build-in sort method, to sort in place:

csv.sort(key=lambda x:x[0])

Now to print each line you can iterate over the array:

for line in csv:
    print(line)

To get the output as asked on your original question (values separated by ,):

print(','.join(line))
leoschet
  • 1,697
  • 17
  • 33
  • You may be right for the small data. But I have millions of line like this and I don't want to do anything manually like making all them in a list and all. – Viral Patel May 19 '18 at 00:30
  • If I got it right, you want to sort the data without loading it to python? I don't think it's possible... Also it's impossible to sort everybody without looking to everybody... – leoschet May 19 '18 at 00:32
  • If performance is the problem, you could try to parallelize a sort method such as `mergesort` – leoschet May 19 '18 at 00:33
  • If performance is a problem maybe we shouldn't use Python from the start. – Anton vBR May 19 '18 at 00:56
  • Scenarios like this are usually used in machine-learning related stuff, where python is a solid option – leoschet May 19 '18 at 01:23
1

Here is the equivalent in pandas. If you want quicker access to the file maybe check something like: http://pythondata.com/working-large-csv-files-python/. The guide will help you make a database of the csv.

import pandas as pd

data = '''\
0.0063,0.0121,band -> mcr music
0.0061,0.0123,band -> mcr
0.0062,0.0122,band -> orchestra'''

file = pd.compat.StringIO(data) # Replace with path/to/file
df = pd.read_csv(file, sep=',', header=None).sort_values(by=1, ascending=False)

for i in df.values:
    print(i)

#df.to_csv('path/to/outfile', index=False, header=False)

Prints:

[0.0061 0.0123 'band -> mcr']
[0.0062 0.0122 'band -> orchestra']
[0.0063 0.0121 'band -> mcr music']
Anton vBR
  • 18,287
  • 5
  • 40
  • 46