-2

I am a marketer I am making some regex to scrape phone number using a tool. I have the following regex which scrape phone number of XXX-XXX-XXXX format perfectly. Here the issue is the page having numbers in more than 6 different lines but I want to scrape only if the line contains <span no="telephone">

((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}

I have tried getting the phone number between the strings some how because of source code of that page that way is not working properly I just want to give a try with the new way.

My page source is always like

<a href="/phone/xxx-xxx-xxxx"  data-toggle="tooltip" data-title="Mobile" >
            <span itemprop="telephone">xxx-xxx-xxxx</span>  

How I can achieve this. Really appreciate your help. Make sure I have scrape after the tags <span itemprop="telephone">

Liam
  • 27,717
  • 28
  • 128
  • 190
nav
  • 95
  • 1
  • 11
  • Say what now? *is the page having numbers in more than 6 different lines* ?? – Liam Nov 27 '18 at 16:30
  • Hey I edited my question. The page having 6 phone numbers but I want to scrape only if that line contains – nav Nov 27 '18 at 16:32
  • You might want to look at using some other method than regex. How about using some xml/html parser like BeautifulSoup. – heap1 Nov 27 '18 at 16:50
  • I am not coding any tool I am using existing tool. I don't think beautifulsoup integrate with my tool. It having only few options. Give me a sec I will upload the image. https://i.postimg.cc/7ZcCwnY0/Screenshot-from-2018-11-27-22-25-15.png – nav Nov 27 '18 at 16:55
  • 2
    I strongly recommend you visit the [help center](https://stackoverflow.com/help/how-to-ask) so you can reformat your question in a manner that may be answerable. – theMayer Nov 27 '18 at 17:12
  • 1
    didnt somebody once say something wise about parsing html with regexes. https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – pm100 Nov 27 '18 at 18:14

2 Answers2

0

You can use the following regex:

@"(?<=<span itemprop=""telephone"">)((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}(?=</span>)"

The regex starts by creating a look behind, looking for: '<span itemprop="telephone">'

Then it uses the regex you already have to match a telehone number.

Finally it uses a look ahead, looking for: '</span>' to finish the match.

Poul Bak
  • 10,450
  • 5
  • 32
  • 57
0

If i've understood correctly you want to know if the number in the <span no="telephone">is a viable phone number using regex? If that is the case the below will spit out 123-456-7891 is it matches your string pattern

string[] phoneNumber = lineContainingNumber.Split('>');

        foreach (string phoneNumberEntity in phoneNumber)
        {
            if (Regex.IsMatch(phoneNumberEntity.Replace(@"</span", ""), @"\d{3}-\d{3}-\d{4}"))
            {
                Console.WriteLine(phoneNumberEntity.Replace(@"</span", ""));
                break;
            }

        }
scottdf93
  • 40
  • 6