9

i always use this commmand line to sort and get uniq lines only and it works as a charm even with large files (over 500,000 lines)

sort filename.txt | uniq | sponge filename.txt

shortest equivalent python code would be

f = open("filename.txt", "r")
lines = [line for line in f]
lines = lines.sort()
lines = set(lines)

but of course this is not scalable because of memory constrains and writing scalable code in python would take time , so i wonder what is the shortest equivalent code (package) in python

evandrix
  • 6,041
  • 4
  • 27
  • 38
Hady Elsahar
  • 2,121
  • 4
  • 29
  • 47
  • 2
    `sort` creates temporary files to hold the intermediate results when you sort huge files, so it's not really that simple. Why do you want to reinvent the wheel? – Blender Nov 04 '13 at 09:22
  • 2
    Apply `set` before `sort` call, that would decrease the `N` in `NlogN`. – Ashwini Chaudhary Nov 04 '13 at 09:23
  • yes , i don't want to reinvent the wheel , `lines` will hold all the file lines in the memory , so i'm asking if there's something that does the same thing as the linux command (scalable , fast) in python – Hady Elsahar Nov 04 '13 at 09:24
  • 1
    @HadyElsahar: I'm confused. If you don't want to reinvent the wheel, why do you want to rewrite `sort | uniq` in Python? – Blender Nov 04 '13 at 09:26
  • @HadyElsahar You can use the `subprocess` module to execute that command from python code, is that what you want? – Ashwini Chaudhary Nov 04 '13 at 09:27
  • i want to implement it inside a python framework and i don't prefer running linux commands from python – Hady Elsahar Nov 04 '13 at 09:28
  • Then you're making it a [XY problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). – Ashwini Chaudhary Nov 04 '13 at 09:29
  • 1
    @HadyElsahar: There's nothing wrong with calling external commands. If you don't want to write the result to a file, read from the stdout of the process (although you may run into buffering issues if it writes faster than you can read). `sort` and `uniq` are portable and will run faster than any pure-Python code. – Blender Nov 04 '13 at 09:32
  • @hcwhsa i don't like mixing python with unix commands inside , but if you said to me that it's common practice to do so , i would take it from you as an answer – Hady Elsahar Nov 04 '13 at 09:32
  • @Blender ok so it's fine then , thanks! – Hady Elsahar Nov 04 '13 at 09:33
  • 1
    @HadyElsahar: There are modules that wrap `subprocess` to make it prettier, if that's deterring you: http://amoffat.github.io/sh/ – Blender Nov 04 '13 at 09:34
  • @HadyElsahar It's ok to use system commands if you are sure this code doesn't need to be portable. It will obviously not run on windows if you use `sort`/`uniq` etc. – Hari Menon Nov 04 '13 at 11:56
  • 1
    `cat file | sort | uniq` is simply `sort -u file`. No need to go through multiple pipes. – DSM Nov 04 '13 at 12:23

4 Answers4

8

You don't need to do a sort in python since set would take care of uniqueness even without sorting.

f = open("filename.txt", "r")
lines = set(f.readlines())

The shell sort command would also load the lines into memory, so using that would not get you any memory savings. If you have really large files or you are adamant on not using additional memory, you can try some crazy tricks like the one shown here: http://neopythonic.blogspot.in/2008/10/sorting-million-32-bit-integers-in-2mb.html

Hari Menon
  • 33,649
  • 14
  • 85
  • 108
  • writing to a file would need loading all the file into the memory first, that's what i want to skip , and also writing extra code to handle this – Hady Elsahar Nov 04 '13 at 09:26
  • Even if you use shell `sort`, it loads the lines into memory. So using shell will not help you in that regard. The above solution would use less memory than `sort` in shell. So, given a choice between calling `sort | uniq` from python and using the above code, you should try to do it within python since the shell doesn't offer any advantage. – Hari Menon Nov 04 '13 at 12:02
  • @HariShankar: `sort` doesn't need to store everything in memory (it may use temporary files) – jfs Nov 04 '13 at 13:23
3

There is an iterator that does what sort does, sorted. Let's make one that mimics uniq, by only yielding lines that aren't equal to the previous line:

def uniq(iterator):
    previous = float("NaN")  # Not equal to anything
    for value in iterator:
        if previous != value:
            yield value
            previous = value

Now you can do the same thing, with:

with open('/path/to/filename') as f:
    for line in uniq(sorted(f)):
        print(line)

BUt sorted (and shell's sort) has to store everything anyway (what if the last line in the file should be output first), so it's worse than just using set(f) instead of uniq(sorted(f)).

RemcoGerlich
  • 30,470
  • 6
  • 61
  • 79
  • You could use `itertools` module to implement `uniq = lambda it: imap(itemgetter(0), groupby(it))`. `sort` doesn't need to store everything in memory (it may use temporary files). To support large input files, see [Sorting text file by using Python](http://stackoverflow.com/a/16954837/4279). – jfs Nov 04 '13 at 13:00
2

Here is a shorter example:

with open("filename.txt", 'r') as f:
    lines = set(f)

Also, one thing, that should be noticed, that in this case, only one line at a time will be loaded into memory. The reason for this is that the above code is equivalent to:

lines = set()
f = open("filename.txt", 'r')
for line in f: # now f works as a generator of lines, reading only one line at a time
     lines.add(line)
Alexander Zhukov
  • 4,357
  • 1
  • 20
  • 31
2

use shell commands from python:

import os
os.system("sort filename.txt | uniq | sponge filename.txt")
evandrix
  • 6,041
  • 4
  • 27
  • 38
Eyal Ch
  • 9,552
  • 5
  • 44
  • 54