0

How can you open a .txt file in Python and get the exact string as it is in the file?

I have a text file containing regular expression e.g.:

\\*(.*?)\\n

When I open the file in Python:

open('regEx.txt', 'r')

I'm getting:

\\\\*(.*?)\\\\n

Is there a way to open this file and get the string exactly as it is written in the file?

gefei
  • 18,922
  • 9
  • 50
  • 67
matetam
  • 75
  • 1
  • 7
  • 5
    You do get it exactly as it is written. You don't want double backslashes in the file. They are only used in Python code if you write them as a non-raw string. [See this ongoing question for further info](http://stackoverflow.com/questions/12871066/what-exactly-is-a-raw-string-regex-and-how-can-you-use-it) – Martin Ender Dec 14 '12 at 15:30
  • Thanks a lot for help. I was looking for posts about loading regex from text files and how to tackle this, and I haven't found this one. I used this regex tester http://re-try.appspot.com/ to check if the string that I load from file will work, and it didn't work, so I looked for answers, but now I got it. Thanks a lot everybody. – matetam Dec 14 '12 at 15:39

3 Answers3

3

You are most likely getting the data exactly as it is in the file (except maybe for line endings but that's not the problem here). The problem is just with the display of that data. Are you working in the shell? It outputs the escape sequences unless you use print explicitly.

Try print open('regEx.txt', 'rb').read() or even open('regEx2.txt','wb').write(open('regEx.txt', 'rb').read()). regEx2.txt will be the same as regEx.txt.

jd.
  • 10,678
  • 3
  • 46
  • 55
2

You are slightly mixing up a few string representations here. The actual regular expression (disregarding any language specific oddities) would simply be

\*(.*?)\n

(literally those 9 characters)

However, I suppose you've either been using Java or Python without raw strings. In that case, to create the above string in memory your code has to double the backslashes:

"\\*(.*?)\\n"

This is because, if you didn't double them, Python would already remove them when compiling the string. But now the string is compiled to these 9 characters again: \*(.*?)\n. If print these out you will get (as jd. said) a display including the double backslashes. But if you call len(string) it will say 9, not 11.

So you only want 9 characters. Then why write 11 in your file? If you write eleven, then upon display the backslashes will be double escaped again. But call len(input) on the result of open. It will say 11, not 15.

This is also why you should always use raw strings when defining regular expressions within your code. Then you never need any additional escapes (except for quotes):

r"\*(.*?)\n"

which will again leave you with 9 characters (because the backslashes are left untouched upon compilation of the string).

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
0

I don't think that's a problem, compare the following:

»»» regex # as read from the file
Out[9]: '\\*(.*?)\\n\n'

»»» r=r'\*(.*?)\n'

»»» r
Out[11]: '\\*(.*?)\\n'

Apart from the newline (which is my fault, I put it in the file) they're the same internally.

m01
  • 9,033
  • 6
  • 32
  • 58