0

I am parsing strings in an html page, and I can get multiple matches for specific strings. I am trying to identify when the strings come after a specific word(s) in the text so I can reject them.

For instance say I am trying to extract a phone # from a page. There may be a few but I don't want the one that comes after "Copyright". Since this can be constructed any way and since the #s I want will come before I wanted to do something like (realizing this is a totally imperfect phone # just using as example)

((Copyright|©)(*))?([0-9]\d{2,3}(-)[0-9]\d{2,3}(-)[0-9]\d{3,4})

I get the * is not the correct way to do wildcards but the larger question is how can I set this up so when capturing a phone # I also capture Copyright if it comes before it anywhere which would include:

Copyright 1972 Acme Corp 555-555-5555

and

Copyright held by Acme Corp
123 West Street
NY, NY 10019
Bla bla
questions call us at 555-555-5555

Ideally what I want to capture is 'Copyright' and '555-555-5555' w/o the wildcard text between. This way any phone #s I capture with Copyright I can reject.

Somewhat OT I understand I could also do something like

(?P<Copyright>(Copyright|Trademark|©))(?P<Wildcard>(*))(?P<NUMBER>([0-9]\d{2,3}(-)[0-9]\d{2,3}(-)[0-9]\d{3,4}))

to make identification easier later on.

In any event my goal is the easiest way to identify after the fact a phone number that occurs at any point in the htmnl after the term copyright so I can reject it.

user3601725
  • 473
  • 1
  • 4
  • 8

1 Answers1

1

This type of information extraction problem will be extremely difficult (if not impossible) to solve using only regular expressions.

If at all possible, you should pre-process your document before attempting to extract the phone numbers.

Some things to consider:

  • strip all HTML markup (ie. remove all mark-up tags and replace with space)
  • convert & normalize all white-space

The resulting text could then be matched using a regular expression.


Here is an example of what this pre-processing step would do to a document:

 <html>
   <head>

   </head>

   <body>
      <p style="some css style etc">some <em>arbitrary</em> text&nbsp;here.</p>

      <div>
        <div>
             More complex                  html nested
             tags
        </div>
             with arbitrary white space including             tabs and 
             new lines.
      </div>


      <footer class="footer_class">
         <p style="css style">Copyright (c) Acme Coropration</p>
         <p style="css style">123 West Street<br/>NY, NY 10019<br/>Bla bla</p>
         <p style="some other css style">question call us at 555-555-5555</p>
      </footer>
   </body>
 </html>

After pre-processing:

 some arbitrary text here. More complex html nested tags with arbitrary white
 space including tabs and new lines. Copyright (c) Acme Corporation 123 West 
 Street NY, NY 10019 Bla Bla questions call us at 555-555-5555

Notice that this way you get a solid block of text. You may want to design some rules for breaking this single-line text block into multiple lines in order to make it easier to recognize when the information you're searching for is connected with certain keywords.

You could also look at the distance between a keyword and the information you're looking for and use that as a heuristic as well.

Mike Dinescu
  • 54,171
  • 16
  • 118
  • 151
  • We do strip the html however we don't normalize that sounds like a good suggestion. I'm wondering if regex can be used to count start position of text? This way I could have a start position for each phone # and a start position for any copyright/trademark or any other term I deem as a break point for good info. THis way say I get phone#1 at 500, phone#2 at 1500 and my simply (Copyright) regex is found at 1000 I can DQ phone2. Is that possible? – user3601725 May 13 '14 at 19:07
  • You haven't specified a language/platform. For instance, in .NET it's very easy to obtain the index where a match occurs relative to the original string. See here: http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.capture.index(v=vs.110).aspx – Mike Dinescu May 13 '14 at 19:14
  • Sorry, we are doing this with Regex PHP – user3601725 May 13 '14 at 19:43
  • See this:http://stackoverflow.com/questions/7465027/how-do-i-find-the-index-of-a-regex-match-in-a-string – Mike Dinescu May 14 '14 at 13:37