0

I'm trying to learn Regex and I am testing out my patterns in the shell.

re.findall(r'\n\t\t\t\t\t(.*)\n\t\t\t\t\t\n\t\t\t\t\t</a>', str(x), re.MULTILINE)

The code is being ran against: http://pastebin.com/yaCXPG3W

print the pattern in shell and the output is correct. However, in my program, the list is empty.

I've tried adding two slashes on the tabs and newlines .. \\t but I still get nothing.

Ryan Shocker
  • 693
  • 1
  • 7
  • 22
  • You can change your regular expression to avoid duplicate information: `re.findall(r'\n\t\t\t\t\t(.*)\n\t\t\t\t\t\n\t\t\t\t\t', str(x), re.MULTILINE)` is the same as `re.findall(r'\n\t{5}(.*)\n\t{5}n\t{5}', str(x), re.MULTILINE)` – Jeff Mandell Dec 09 '15 at 23:50
  • DON'T learn regex with HTML ! And don't try to parse HTML with regex, you will kill lot of kitties. Check http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Gilles Quénot Dec 09 '15 at 23:51
  • great, thanks for the tip. Unfortunately still no match. – Ryan Shocker Dec 09 '15 at 23:52
  • @GillesQuenot it's not interpreted as html if it's in plain string format though? – Ryan Shocker Dec 09 '15 at 23:53
  • Given that Python's IDLE is not mentioned in the question (or comments and answers), nor relevant to the question that I can see, why is it in the title and tags? – Terry Jan Reedy Dec 10 '15 at 18:59
  • 1
    If by 'shell', you mean IDLE's shell window, what happens on the regular console shell? (If would be a bug for them to act different for re.findall.) Also, posted code should include example data (shorter than 100s of chars) needed to run and demonstrate the issue (https://stackoverflow.com/help/mcve): `x = ` – Terry Jan Reedy Dec 10 '15 at 19:17
  • The pastebin page has been removed. Now the question is unclear. – Armali Sep 21 '17 at 06:44

1 Answers1

1

This seems to work fine here. The \n and \t are literal characters in the pastebin you provided, so the backslashes need to be escaped.

import re
x = open('data.html').read()
m  = re.findall(
  r'\\n\\t\\t\\t\\t\\t(.*)\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t</a>',
  x,
  re.MULTILINE)
print(m)

And, as suggested by Jeff Mandell you can shorten the regex by:

\\n(\\t){5}(.*)\\n(\\t){5}\\n(\\t){5}</a>

So, this means that if you have a file containing actual newlines, a regex r'\n' will match those.

v = '\n'
print(v) # prints a blank line
print(len(v)) # outputs 1
m = re.match(r'\n', v)
print(m) # match
m = re.match(r'\\n', v)
print(m) # no match

v = '\\n' # which would appear as \n in your text editor
print(v) # prints the two characters \ and n
print(len(v)) # outputs 2
m = re.match(r'\n', v)
print(m) # no match
m = re.match(r'\\n', v)
print(m) # match
Takis
  • 726
  • 5
  • 11
  • Not it's returning the tabs and new lines in the list aside from just the string, which my shell was doing. ie, I need CSE 412 Database Management – Ryan Shocker Dec 09 '15 at 23:59
  • I think that's because your file actually contains the characters slash '\' followed by 'n', instead of actual newlines '\n'. At least the pastebin actually contains those characters and contains no newlines whatsoever. – Takis Dec 10 '15 at 00:01
  • so what if I want to capture the literal string in the pattern above for \n and \t? – Ryan Shocker Dec 10 '15 at 00:02
  • I added some example code to illustrate the difference between '\n' and '\\n'. – Takis Dec 10 '15 at 00:19