0

I have a directory with several hundred thousand files in it.

They all follow this format:

datetime_fileid_metadata_collect.txt

A specific example looks like this :

201405052359559_0002230255_35702088_collect88.txt

I am trying to write a script that pulls out and copies individual files when all I provide it is a list of file ids.

For example I have a text document fileids.txt that constains this

fileids.txt
0002230255
0001627237
0001023000

This is the example script I have written so far. file1 result keeps returning []

import os
import re, glob, shutil
base_dir = 'c:/stuff/tub_0_data/'
destination = 'c:/files_goes_here'
os.chdir(base_dir)
text_file = open('c:/stuff/fileids.txt', 'r')
file_ids = text_file.readlines()
#file_ids = [stripped for stripped in (line.strip() for line in text_file.readlines()) if stripped]
for ids in file_ids:
    id1 = ids.rstrip()
    print 'file id = ',str(id1)
    file1 = glob.glob('*' + str(id1) + '*')
    print str(file1)
    if file1 != []:
        shutil.copy(base_dir + file1, destination)

I know I dont fully understand glob or regular expressions yet. What would I put there if I want to find files based off of a specific string of their filename?

EDIT:

glob.glob('*' + stuff '*') 

worked for finding things within the filename. Not removing linespace was the issue.

AlienAnarchist
  • 186
  • 4
  • 15
  • Change this line: `file_ids = text.file.readlines()` to `file_ids = text_file.readlines()` and run it again. The typo on that `_` could be a problem. – WGS Sep 22 '14 at 22:40
  • Fixed, its actually part of a much bigger script and the typo isnt present in the original code. I just rewrote the core code for my question. The glob.glob is where I believe my problem is. – AlienAnarchist Sep 22 '14 at 22:44

2 Answers2

2

text_file.readlines() reads the entire line including the trailing '\n'. Try stripping it. The following will strip newlines and remove empties:

file_ids = [line.strip() for line in text_file if not line.isspace()]
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • This isnt working in IDLE. the comma at "for line in text_file,readlies()' is causing an error. – AlienAnarchist Sep 22 '14 at 22:49
  • @AlienAnarchist - yeah, i just noticed. that should be a period. fixed it. – tdelaney Sep 22 '14 at 22:49
  • Still getting [] returned for every entry – AlienAnarchist Sep 22 '14 at 22:50
  • 1
    It would probably be better to use `if not line.isspace()` on the inner loop, instead of iterating through it twice. – parchment Sep 22 '14 at 22:51
  • 1
    @AlienAnarchist - I assume that glob.glob("*") lists a bunch of files and that the stuff printed by `print 'file id = ',str(ids)` is what you want... that is, if you went to the command line and entered `dir c:\\files_goes_here\\*theidthatprinted*`, you get the files you want? – tdelaney Sep 22 '14 at 22:56
  • Actually I'm an idiot it works now and linespace was the issue. glob.glob('*' + stuff + '*') worked after trying enough things. – AlienAnarchist Sep 22 '14 at 22:58
  • 1
    @AlienAnarchist - something else to consider... since this is a huge directory listing, you could read it once with os.listdir() and then just filter that list for each id (maybe even split out the id part of the file name). Glob re-reads the directory each time and that's expensive for you. – tdelaney Sep 22 '14 at 23:00
1

Your issue might have been linespace and it might have been answered, but I think you can do with some cleaning up of the code. Admittedly, I don't see the need for the import os and import sys, unless they are part of your bigger code.

Something like the following works well enough.

Code:

import glob
import shutil

base_dir = "C:/Downloads/TestOne/"
dest_dir = "C:/Downloads/TestTwo/"

with open("blah.txt", "rb") as ofile:
    lines = [line.strip() for line in ofile.readlines()]
    for line in lines:
        print "File ID to Process: {}".format(line)
        pattern_ = base_dir + "*" + str(line) + "*"
        print pattern_
        file_ = glob.glob(pattern_)
        print str(file_[0])
        shutil.copy(file_[0], dest_dir)
        print "{} copied.".format(file_[0])

Output:

File ID to Process: 123456
C:/Downloads/TestOne/*123456*
C:/Downloads/TestOne\foobar_123456_spam.txt
C:/Downloads/TestOne\foobar_123456_spam.txt copied.
[Finished in 0.4s]

glob is a rather expensive operation though. You're better off listing the files on the get-go and match them afterwards, copying as you hit a match. Hope this helps.

WGS
  • 13,969
  • 4
  • 48
  • 51
  • 1
    I agree that glob is the slow way to go and that's why I stuck with readlines in my answer... having the lines of fileids.txt in a list is nice if OP changes his code to iterate with os.listdir. – tdelaney Sep 22 '14 at 23:07
  • Agree as well. Even if `glob` doesn't use `regex`, I think the Unix feature where it's based cannot possibly be faster than the microseconds it takes to parse a list instead. Also, cleaner code is always a plus. – WGS Sep 22 '14 at 23:11