1

I'm trying to write a regex in python to find directory path: the text I have is shown below:

text = "The public disk is: \\\\diskA\\FolderB\\SubFolderC\\FileD"

I tried to use:

import re
my_regex = re.compile(r'\\(.*?)+\\(.*?)')
result = my_regex.search(text)
print(result)

this is what I got as result:

<_sre.SRE_Match object; span=(7, 9), match='\\\\'>

So seems like the regex can recognize \\, but not \... Has anyone ran into similar situation before? Please help. Any advice is welcome! Thanks!!

S.J
  • 109
  • 2
  • 3
  • 11
  • ik seems that you are mixing raw strings and strings. have a look at raw string literals in python. https://stackoverflow.com/questions/2081640/what-exactly-do-u-and-r-string-flags-do-and-what-are-raw-string-literals also check out [the backslash plaue](https://docs.python.org/3.7/howto/regex.html#the-backslash-plague) in the pyton regex docs – Marc Wagner Oct 03 '18 at 14:02
  • Hi Marc, the backslash def helped. i"ve changed my_regex = re.compile(r'\\(.*?)+\\(.*?)'), so now it can find \\\\, but still not the double backslash... – S.J Oct 03 '18 at 14:37
  • try the following statement. it should work search = r'\\(\\[\w]*)*\\' -- if you quickly want to debug your regex statement, have a look at this site: https://regex101.com/ Another usefull resource is this video https://www.youtube.com/watch?v=bgBWp9EIlMM – Marc Wagner Oct 03 '18 at 14:45
  • sorry, comments don't play nice with backslashes... i posted the code below – Marc Wagner Oct 03 '18 at 14:52

1 Answers1

2

It looks like your regex search term does not do what you want it to do.

try this regex:

import re    
text = r"The public disk is: \\diskA\FolderB\SubFolderC\FileD"

searchtext = r'\\(\\\w+)*\\'

my_regex = re.compile(searchtext)
result = my_regex.search(text)
print(result.group())

>>> \\diskA\FolderB\SubFolderC\

ok, so what's going on here? It may help to follow allong on an online regex editor such as https://regex101.com/

so it looks like your folders are allways structured

\\disk\folder\subfolder\sub-subfoler\...etc..\file

so the structure we want to look for is something starting with \\ and ending with \ in between are one or more disk\directory names using word characters.

The query is looks for a piece of text that starts and ends with a \ and has zero or more \dir statements between them. so \\, \\disk\, \\disk\dir\, all match.

putting the query together we get

\\ # the starting backslash (escaped because backslash is also a special character)
(\\\w+)* # one or more word characters (\w) preceded by an escaped backslash repeated zero or more times
\\ # finally another backslash, escaped

if you want to expand the valid characters in the file path, edit the \w part of the regex. eg if you want ( and ) as valid characters as well:

searchtext = r'\\(\\[\w()]+)*\\'

note that I added square brackets and added more characters.

The square brackets are basically optional characters... they mean any of these characters. Some characters do not need to be escaped, but some others do. eg . does not need to be escaped, but [ and ] does.

a semi complete list would be

searchtext = r'\\(\\[\w()\[\]\{\}:`!@#_\-]+)*\\'
Marc Wagner
  • 1,672
  • 2
  • 12
  • 15
  • Thank you so much Marc, that's very helpful. May I just ask 1 followup question: what if I also have symbols in the path, for example: text = r"The public disk is: \\disk(A)\Folder_B\SubFolder(C)\FileD". I know that (\\\w+) might not work anymore, what should I change to capture other symbols (), _ , etc? Thanks again! – S.J Oct 03 '18 at 16:10
  • hmmm. filenames are a bit of a rabit hole. see [characters in filenames](https://stackoverflow.com/questions/4814040/allowed-characters-in-filename) what I would suggest is whitelisting some characters that you want to allow. eg [a-zA-Z0-9\\(\\)!] I've updated the answer to reflext that. Ps, if you like my anwer, please mark it as the correct answer. – Marc Wagner Oct 03 '18 at 17:57