0

I have this simple program that takes in a file from stdin and output only the host (example: returning only HOST.

Except when I run cat sample.html | python program.py right now it outputs href"=google.com

I want it to remove the 'href=" part and have it only output google.com, but when I tried removing it from my re, it became even worse. Thoughts?

import re
import sys

s = sys.stdin.read()
lines=s.split('\n')

match = re.search(r'href=[\'"]?([^\'" >]+)', s) #here
if match:
    print match.group(0)

Thank you.

user3295674
  • 893
  • 5
  • 19
  • 42

1 Answers1

2

That is because you reference group(0) when it should be group(1) which holds the actual match result.

if match:
   print match.group(1)
hwnd
  • 69,796
  • 4
  • 95
  • 132
  • Thank you! I was wondering, I noticed my regex doesn't work if it's href='text' with single quotations marks or not using quotation marks, how do I also add that to my regex? – user3295674 Oct 17 '14 at 01:36
  • Actually it does catch those cases. – hwnd Oct 17 '14 at 01:54
  • it's strange because I tested it and it only returned the ones case with the "" but not single quotations :/ Thanks though. – user3295674 Oct 17 '14 at 02:09