0

This regex is supposed to find a string that finds something in this format exactly:

201308 - (82608) - MAC 2233-007-Methods of Calculus - Lastname, Lee.txt

The only caveat is the last part between the last hyphen and the .txt, and the course name right before that, can both be a variable number of letters (the instructor name and course name). All else has that number of characters in that format (either int numbers separated exactly by that many spaces and hyphens or that exact course prefix with all cap letters).

What the regex is actually doing is finding nothing at all. Without trying to escape the parentheses it was catching some files, but now nada. I'm using re.search instead of re.match because obviously the regex is not finished and I'm testing pieces of it.

import re, os, sys, shutil

def readDir(path1):
    return [ f for f in os.listdir(path1) if os.path.isfile(os.path.join(path1,f)) ]

def files(dir1,term,path1):
    match2 = []; stillWrong = []#; term = str(term)
    for f in dir1:
        result = re.search(term + "\s\b\s\(\d{5}\)\s\b\s\w{3}\s\d{4}\b\d{3}[a-z\A-Z]+\s\b\s[A-z\a-z]+\b\s[A-Z\a-z]+ .txt",f)
        if result: match2.append(f)
        else: stillWrong.append(f)
        #print "split --- ",os.path.split(f)
        ##else: os.rename(path1+'\\'+f, path1+'\\'+'@ '+f); stillWrong.append(f)
        print "f ---- ",f
    return match2, stillWrong

term = "201308"; src = "testdir1"; dest = "testdir2"

print files(readDir(dest),term,dest)

This produces the (obviously) wrong:

    >>> 
f ----  @ @ @ @ @ @ 123 abc - a-1 - b-2.txt
f ----  @ @ @ @ @ @ 201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt
f ----  @ @ @ @ @ @ 201308 abc 123.txt
f ----  @ @ @ @ @ @ 201308-(12345) - Abc 2233-007-course Name - last, first.txt
f ----  @ @ @ @ @ @ 45-12 - xyz - mno - 123-pqr-tuv-456.txt
f ----  @ @ @ @ @ @ @ @ @ 201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt
f ----  @ @ @ @ @ @ @ @ @ 201308 abc 123.txt
f ----  @ @ @ @ @ @ @ @ @ 201308-(12345) - Abc 2233-007-course Name - last, first.txt
f ----  @ @ @ @ @ @ @ @ @ @ 123 abc - a-1 - b-2.txt
f ----  @ @ @ @ @ @ @ @ @ @ 45-12 - xyz - mno - 123-pqr-tuv-456.txt
f ----  @ @ @ @ @ @ @ @ @ @ @ xxxxx xxxxx xxxxx 123 abc - a-1 - b-2.txt
f ----  @ @ @ @ @ @ @ @ @ @ @ xxxxx xxxxx xxxxx 45-12 - xyz - mno - 123-pqr-tuv-456.txt
([], ['@ @ @ @ @ @ 123 abc - a-1 - b-2.txt', '@ @ @ @ @ @ 201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt', '@ @ @ @ @ @ 201308 abc 123.txt', '@ @ @ @ @ @ 201308-(12345) - Abc 2233-007-course Name - last, first.txt', '@ @ @ @ @ @ 45-12 - xyz - mno - 123-pqr-tuv-456.txt', '@ @ @ @ @ @ @ @ @ 201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt', '@ @ @ @ @ @ @ @ @ 201308 abc 123.txt', '@ @ @ @ @ @ @ @ @ 201308-(12345) - Abc 2233-007-course Name - last, first.txt', '@ @ @ @ @ @ @ @ @ @ 123 abc - a-1 - b-2.txt', '@ @ @ @ @ @ @ @ @ @ 45-12 - xyz - mno - 123-pqr-tuv-456.txt', '@ @ @ @ @ @ @ @ @ @ @ xxxxx xxxxx xxxxx 123 abc - a-1 - b-2.txt', '@ @ @ @ @ @ @ @ @ @ @ xxxxx xxxxx xxxxx 45-12 - xyz - mno - 123-pqr-tuv-456.txt'])
>>> 

As you can see there's nothing in match2[] list (if you're interested, those are the filenames in the 2nd list, but the 1st list holds the relevant matches). I'm teaching myself Python and regex, and it's not going well. I've tried these (and regex tutorials) but didn't seem helpful in this case:

Escaping regex string in Python

Regex to escape the parentheses

How to implement \p{L} in python regex

All of the @ are from the os.rename that you see commented out, but it didn't work before that was commented anyhow. I'm sure any entry-level programmer could top this off in a few minutes, but if a pro happens on this question and would spare a minute, that's great too.

EDIT: List of filenames used (production list is much longer obviously):

201308-(12345) - Abc 2233-007-course Name - last, first.txt
201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt
@ @ @ @ @ @ 201308 abc 123.txt
@ @ @ @ @ @ 123 abc - a-1 - b-2.txt
@ @ @ @ @ @ 45-12 - xyz - mno - 123-pqr-tuv-456.txt
@ @ @ @ @ @ @ @ @ 201308-(12345) - Abc 2233-007-course Name - last, first.txt
@ @ @ @ @ @ @ @ @ 201308 abc 123.txt
@ @ @ @ @ @ @ @ @ 201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt
@ @ @ @ @ @ @ @ @ @ 123 abc - a-1 - b-2.txt
@ @ @ @ @ @ @ @ @ @ 45-12 - xyz - mno - 123-pqr-tuv-456.txt
@ @ @ @ @ @ @ @ @ @ @ xxxxx xxxxx xxxxx 123 abc - a-1 - b-2.txt
@ @ @ @ @ @ @ @ @ @ @ xxxxx xxxxx xxxxx 45-12 - xyz - mno - 123-pqr-tuv-456.txt
45-12 - xyz - mno - 123-pqr-tuv-456.txt
123 abc - a-1 - b-2.txt
201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt
201308 abc 123.txt
201308-(12345) - Abc 2233-007-course Name - last, first.txt
Community
  • 1
  • 1
stackuser
  • 869
  • 16
  • 34

2 Answers2

2

Some things seem very strange for me:

  • \s\b\s is aberrant because \b means "Matches the empty string, but only at the beginning or end of a word`" but here it's between two symbols meaning whitespace, that is to say not at beginning or end of a word.

  • the antislash in [A-z\a-z] provokes an error. I wonder what it's supposed to mean here. Do you want an antislash as a possible character of the sett ? then write [A-z\\\\a-z]

This regex matches your example string:

r = re.compile(term +
               ("\s-\s"
                "\(\d{5}\)"
                "\s-\s"
                "\w{3}\s\d{4}-\d{3}-"
                "[a-zA-Z ]+"
                "\s-\s"
                "[A-za-z]+,\s"
                "[A-Za-z]+ *.txt"))
eyquem
  • 26,771
  • 7
  • 38
  • 46
  • Sorry for any confusion I caused with the antislash, that was my own misunderstanding of what I should have doing in the regex. Yours is a beautiful regex, it's just that someone answered earlier in time than you, but if I could accept 2 answers then yours would be it. And I can clearly see a lot of what I was doing wrong, by looking at how you corrected. – stackuser Sep 01 '13 at 18:57
1

\d{6}\s-\s\(\d{5}\)\s-\s\w{3}\s\d{4}-\d{3}-[^\.]+\.txt matches the string you sent in as an example. If the initial value is unknown, term + '\s-\s\(\d{5}\)\s-\s\w{3}\s\d{4}-\d{3}-[^\.]+\.txt' should do it (provided term plays nice for the regex).

adding a test run sample:

>>> term = '201308'
>>> f = '201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt'
>>> re.search(term + '\s-\s\(\d{5}\)\s-\s\w{3}\s\d{4}-\d{3}-[^\.]+\.txt', f).group(0)
'201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt'

yet another:

>>> f = '/somefolder/somefolder2/201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt'
>>> re.search(term + '\s-\s\(\d{5}\)\s-\s\w{3}\s\d{4}-\d{3}-[^\.]+\.txt', f).group(0)
'201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt'

>>> f = 'c:\\somefolder\\somefolder2\\201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt'
>>> re.search(term + '\s-\s\(\d{5}\)\s-\s\w{3}\s\d{4}-\d{3}-[^\.]+\.txt', f).group(0)
'201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt'
planestepper
  • 3,277
  • 26
  • 38
  • Unfortunately `match2[]` is still empty and `stillWrong[]` is totally full. But that clears up some confusion I was having about where to place the hyphens and whether to use `\b` and few other things. But still not working, sorry. – stackuser Sep 01 '13 at 18:33
  • Add a list of file names so that a proper test can be run – planestepper Sep 01 '13 at 18:35
  • Yes, you're right the problem is not in your regex. That's what I need and what I needed to learn as well. – stackuser Sep 01 '13 at 18:53