I am parsing strings in an html page, and I can get multiple matches for specific strings. I am trying to identify when the strings come after a specific word(s) in the text so I can reject them.
For instance say I am trying to extract a phone # from a page. There may be a few but I don't want the one that comes after "Copyright". Since this can be constructed any way and since the #s I want will come before I wanted to do something like (realizing this is a totally imperfect phone # just using as example)
((Copyright|©)(*))?([0-9]\d{2,3}(-)[0-9]\d{2,3}(-)[0-9]\d{3,4})
I get the * is not the correct way to do wildcards but the larger question is how can I set this up so when capturing a phone # I also capture Copyright if it comes before it anywhere which would include:
Copyright 1972 Acme Corp 555-555-5555
and
Copyright held by Acme Corp
123 West Street
NY, NY 10019
Bla bla
questions call us at 555-555-5555
Ideally what I want to capture is 'Copyright' and '555-555-5555' w/o the wildcard text between. This way any phone #s I capture with Copyright I can reject.
Somewhat OT I understand I could also do something like
(?P<Copyright>(Copyright|Trademark|©))(?P<Wildcard>(*))(?P<NUMBER>([0-9]\d{2,3}(-)[0-9]\d{2,3}(-)[0-9]\d{3,4}))
to make identification easier later on.
In any event my goal is the easiest way to identify after the fact a phone number that occurs at any point in the htmnl after the term copyright so I can reject it.