1

I've inherited code for a website, and this particular function is used to get a description from a website when a part number is given. I've never worked with regular expressions before so this set is a little out of my area, and would like some help figuring out why it's not working properly.

Essentially the ideal operation of this functions is that, when a user of the site inputs a part number in the appropriate field and presses a button, the standard part description, which is gotten from a separate site, is outputted to the user. I inspected the element on the third party site that the regex is trying to match and it's coded as

<span id="ctl00_BodyContentPlaceHolder_lblDescription">Random Description</span>
public static string GetPartHpDescription(string url)
    {

        // Create a request to the url
        HttpWebRequest request = HttpWebRequest.Create(url) as HttpWebRequest;

        // If the request wasn't an HTTP request (like a file), ignore it
        if (request == null) return null;

        // Use the user's credentials
        request.UseDefaultCredentials = true;

        // Obtain a response from the server, if there was an error, return nothing
        HttpWebResponse response = null;
        try { response = request.GetResponse() as HttpWebResponse; }
        catch (WebException) { return null; }

        // Regular expression for an HTML title
        //  string regex = @"(?<=<body.*>)([Description : HP]*)(?=</body>)";
        string regex = "<span [^>]*id=(\"|')ctl00_BodyContentPlaceHolder_lblDescription(\"|')>(.*?)</span>";
        string regex1 = "<span [^>]*id=(\"|')ctl00_BodyContentPlaceHolder_gvGeneral_ctl02_lblpartdesc1(\"|')>(.*?)</span>";
        // Regex re = new Regex(@"<span\s+id=""ctl00_BodyContentPlaceHolder_lblDescription");
        // string regex =  @"<span\s+id=""ctl00_BodyContentPlaceHolder_lblDescription"
        // If the correct HTML header exists for HTML text, continue
        if (new List<string>(response.Headers.AllKeys).Contains("Content-Type"))
            if (response.Headers["Content-Type"].StartsWith("text/html"))
            {
                // Download the page
                WebClient web = new WebClient();
                web.UseDefaultCredentials = true;
                string page = web.DownloadString(url);
                // string title = Regex.Match(page, @"<span\s+id=""ctl00_BodyContentPlaceHolder_lblDescription"">.*?</span>", RegexOptions.IgnoreCase).Groups["Title"].Value;
                // Extract the title
                Regex ex = new Regex(regex, RegexOptions.IgnoreCase);
                String data = ex.Match(page).Value.Trim();
                if (data == "")
                {
                    Regex ex1 = new Regex(regex1, RegexOptions.IgnoreCase);
                    data = ex1.Match(page).Value.Trim();
                }
                return data;
                //   return title;
            }

        // Not a valid HTML page
        return null;
    }

What's currently happening is that if the Part No is not currently in the system database (sql backend) then the function doesn't get the part description properly.

  • In order to help you, we would need some html example you want to parse, but as a general rule always add the @ in front of your regex, im not seeing you accounting for the escape characters. The @ make the string literal thus not needing to for example use "\\" – nalnpir Jun 10 '19 at 15:38
  • 1
    Don't parse HTML using regular expressions. Use an HTML parser. – Daniel Mann Jun 10 '19 at 15:42
  • edited so you can see the code I'm parsing for in the third party website. As I said, I inherited the code, and am new to the concept of regex so I'm just trying to figure out what its doing and why its not working properly – kirsten.madina Jun 10 '19 at 16:06
  • So ideally once this function runs and matches the "span" class to the correct span class on the third party website, the output should be the object of the class. in the example about, it would output as a string "Random Description" – kirsten.madina Jun 10 '19 at 16:11
  • 1
    How does it work? Badly. [Regex isn't an HTML parser](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). Really. – spender Jun 10 '19 at 16:12

1 Answers1

0

My guess is that we have some IDs that we wish to extract their textContnet, and if we have to do so with regular expressions, we would start with a simple expression, then if necessary, we would add more constraints,

<span id=["'](ctl00_.+|other_ids)["']>(.+?)<\/span>

Demo

using System;
using System.Text.RegularExpressions;

public class Example
{
    public static void Main()
    {
        string pattern = @"<span id=[""'](ctl00_.+|other_ids)[""']>(.+?)<\/span>";
        string input = @"<span id=""ctl00_BodyContentPlaceHolder_lblDescription"">Random Description</span>
<span id='ctl00_BodyContentPlaceHolder_lblDescription'>Random Description</span>
";
        RegexOptions options = RegexOptions.Multiline;

        foreach (Match m in Regex.Matches(input, pattern, options))
        {
            Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
        }
    }
}

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Emma
  • 27,428
  • 11
  • 44
  • 69