4

I am trying to search whole word pid in the link but somewhat this is also searching for id in this code

    for a in self.soup.find_all(href=True):

        if 'pid' in a['href']:
            href = a['href']
            if not href or len(href) <= 1:
                continue
            elif 'javascript:' in href.lower():
                continue
            else:
                href = href.strip()
            if href[0] == '/':
                href = (domain_link + href).strip()
            elif href[:4] == 'http':
                href = href.strip()
            elif href[0] != '/' and href[:4] != 'http':
                href = ( domain_link + '/' + href ).strip()
            if '#' in href:
                indx = href.index('#')
                href = href[:indx].strip()
            if href in links:
                continue

            links.append(self.re_encode(href))
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • Sorry I mean regular expression – Dhrubo Naskar Sep 05 '15 at 00:40
  • I'm not clear what's wrong here. Can you make it clear which part of the code you're having problems with, and specifically how it is behaving now and how you want it to behave? – larsks Sep 05 '15 at 01:53
  • I think this might be a duplicate of [test string for a substring](http://stackoverflow.com/questions/5473014/test-a-string-for-a-substring) – C8H10N4O2 Sep 05 '15 at 01:55
  • What is some sample input that doesn't work? How do you know that that sample input doesn't work? What would it output if it was working properly? – ArtOfWarfare Sep 05 '15 at 02:01
  • if 'pid' it recognises all the pid also sid also id where, I just want to get the whole word 'pid' into the search. – Dhrubo Naskar Sep 05 '15 at 02:06

1 Answers1

3

If you mean that you want it to match a string like /pid/0002 but not /rapid.html, then you need to exclude word characters on either side. Something like:

>>> re.search(r'\Wpid\W', '/pid/0002')
<_sre.SRE_Match object; span=(0, 5), match='/pid/'>
>>> re.search(r'\Wpid\W', '/rapid/123')
None

If 'pid' might be at the start or end of the string, you'll need to add extra conditions: check for either the start/end of line or a non-word character:

>>> re.search(r'(^|\W)pid($|\W)', 'pid/123')
<_sre.SRE_Match object; span=(0, 4), match='pid/'>

See the docs for more information on the special characters.

You could use it like this:

pattern = re.compile(r'(^|\W)pid($|\W)')
if pattern.search(a['href']) is not None:
    ...
z0r
  • 8,185
  • 4
  • 64
  • 83
  • 1
    Actually there are three situation one is ?pid= , one is where it takes sid=tyy,4mr&icmpid and another one only with id like Widget etc. I just want to show the first one with only ?pid – Dhrubo Naskar Sep 05 '15 at 03:36
  • Thanks I used this expression and it worked pattern = re.compile(r'(\?pid\=)') – Dhrubo Naskar Sep 05 '15 at 03:56
  • Cool. But in that case you might like to do proper URL parsing. Python has libraries to help: see [urllib.parse](https://docs.python.org/3/library/urllib.parse.html) (py3) and [urlparse](https://docs.python.org/2/library/urlparse.html) (py2). Makes it easy to handle other cases like where the `pid` argument isn't first (`&pid=...`). – z0r Sep 05 '15 at 12:17