6

I have a folder with 100k text files. I want to put files with over 20 lines in another folder. How do I do this in python? I used os.listdir, but of course, there isn't enough memory for even loading the filenames into memory. Is there a way to get maybe 100 filenames at a time?

Here's my code:

import os
import shutil

dir = '/somedir/'

def file_len(fname):
    f = open(fname,'r')
    for i, l in enumerate(f):
        pass
    f.close()
    return i + 1

filenames = os.listdir(dir+'labels/')

i = 0
for filename in filenames:
    flen = file_len(dir+'labels/'+filename)
    print flen
    if flen > 15:
        i = i+1
        shutil.copyfile(dir+'originals/'+filename[:-5], dir+'filteredOrigs/'+filename[:-5])
print i

And Output:

Traceback (most recent call last):
  File "filterimage.py", line 13, in <module>
    filenames = os.listdir(dir+'labels/')
OSError: [Errno 12] Cannot allocate memory: '/somedir/'

Here's the modified script:

import os
import shutil
import glob

topdir = '/somedir'

def filelen(fname, many):
    f = open(fname,'r')
    for i, l in enumerate(f):
        if i > many:
            f.close()
            return True
    f.close()
    return False

path = os.path.join(topdir, 'labels', '*')
i=0
for filename in glob.iglob(path):
    print filename
    if filelen(filename,5):
        i += 1
print i

it works on a folder with fewer files, but with the larger folder, all it prints is "0"... Works on linux server, prints 0 on mac... oh well...

extraeee
  • 3,096
  • 5
  • 27
  • 28
  • 3
    "there isn't enough memory for even loading the filenames into memory" Really? 100K file names isn't really all that much memory. What error are you getting? Can you post the snippet of code? – S.Lott Feb 01 '10 at 14:20
  • 1
    Why is memory a problem? 100k files with names of, say, 10 characters each, is 10^7 bytes = 10 megabytes, not too big really. – Andrew Jaffe Feb 01 '10 at 14:21
  • I agree that an OOM is strange. What happens if you enter `filenames = os.listdir('/somedir/labels/')` at the REPL? – Charles Stewart Feb 01 '10 at 15:02
  • What OS is this? Windows? Linux? Which Linux? Can you do a "cat" (or something) which will read every single file in the directory? – S.Lott Feb 01 '10 at 15:15
  • @S.Lott this is Linux 2.6.24 server – extraeee Feb 01 '10 at 17:34
  • "Linux 2.6.24"? That's the kernel, which is shared by a large number of distributions. Which Linux distribution? Fedora? Ubuntu? Suse? – S.Lott Feb 01 '10 at 17:57
  • actually, i'm testing on my mac osx snowleopard – extraeee Feb 01 '10 at 22:46
  • @cseric: you might want to add check for the existence of the `labels` directory. Also, `i = i + 1` should rather be `i += 1`. – SilentGhost Feb 02 '10 at 13:55

6 Answers6

4

you might try using glob.iglob that returns an iterator:

topdir = os.path.join('/somedir', 'labels', '*')
for filename in glob.iglob(topdir):
     if filelen(filename) > 15:
          #do stuff

Also, please don't use dir for a variable name: you're shadowing the built-in.

Another major improvement that you can introduce is to your filelen function. If you replace it with the following, you'll save a lot of time. Trust me, what you have now is the slowest alternative:

def many_line(fname, many=15):
    for i, line in enumerate(open(fname)):
        if i > many:
            return True
    return False
Community
  • 1
  • 1
SilentGhost
  • 307,395
  • 66
  • 306
  • 293
2

A couple thoughts. First, you might use the glob module to get smaller groups of files. Second, sorting by line count is going to be very time consuming, as you have to open every file and count lines. If you can partition by byte count, you can avoid opening the files by using the stat module. If it's crucial that the split happens at 20 lines, you can at least cut out large swaths of files by figuring out a minimum number of characters that a 20 line file of your type will have, and not opening any file smaller than that.

jcdyer
  • 18,616
  • 5
  • 42
  • 49
0
import os,shutil
os.chdir("/mydir/")
numlines=20
destination = os.path.join("/destination","dir1")
for file in os.listdir("."):
    if os.path.isfile(file):
        flag=0
        for n,line in enumerate(open(file)):
            if n > numlines: 
                flag=1
                break
        if flag:
            try:
                shutil.move(file,destination) 
            except Exception,e: print e
            else:
                print "%s moved to %s" %(file,destination)
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
0

how about using a shell script? you could pick one file at a time:

for f in `ls`;
loop
if `wc -l f`>20; then
  mv f newfolder
fi
end loop

ppl please correct if i am wrong in any way

Aadith Ramia
  • 10,005
  • 19
  • 67
  • 86
0

The currently accepted answer just plain doesn't work. This function:

def many_line(fname, many=15):
    for i, line in enumerate(line):
        if i > many:
            return True
    return False

has two problems: Firstly, the fname arg is not used and the file is not opened. Secondly, the call to enumerate(line) will fail because line is not defined.

Changing enumerate(line) to enumerate(open(fname)) will fix it.

John Machin
  • 81,303
  • 11
  • 141
  • 189
0

You can use os.scandir which is a generator, and therefore does not read all file names at once (comes with python 3.5, otherwise or just simply: pip install scandir).

Example:

    import os
    for file in os.scandir(path):
        do_something_with_file(path+file.name)

scandir documentation: https://pypi.org/project/scandir/

Anon
  • 619
  • 1
  • 9
  • 18