-2

This is the link to the HTML file i have downloaded

https://drive.google.com/open?id=1z7A9U0qZSVtLMQDbsVtPyZVz9Zm73-ZQ

from this file at the end you can see some data like this

<div data-react-class="packs/v9/phone/containers/AreaCodeListing" data-react-props="{"areaCodes":[{"phone_prefix":"(202) 200","details":["Sprint"],"location":"Washington, DC","href":"/202-200"},{"phone_prefix":"(202) 201","details":["Verizon"],"location":"Washington, DC","href":"/202-201"},{"phone_prefix":"(202) 202","details":["General Service Carrier"],"location":"Washington, DC","href":"/202-202"},{"phone_prefix":"(202) 203","details":["T-Mobile"],"location":"Washington, DC","href":"/202-203"},{"phone_prefix":"(202) 204","details":["XO Communications"],"location":"Washington, DC","href":"/202-204"}

From this page how can i extract href values ? I think JSON can do the job but i am stuck in how to reach to that point to get that json

Or is there any other best way to get href value from this HTML page i have downloaded ?

RAJA SAHAB
  • 3
  • 1
  • 7

3 Answers3

0

You can use libraries like HTLMAgilityPack to parse the HTML document and then extract out the JSON as required.

Shinva
  • 1,899
  • 18
  • 25
0

The file you downloaded is not valid HTML, because it is a React view. Therefore, tools like HTMLAgilityPack will not be very helpful for you.

You could try to see if you have any luck using headless browsers such as WebKit.NET. You might be able to interject somewhere in the process of building the final HTML.

Apart from that, the only option I can think of is to use regular expressions to get the data you want from the file. For example:

var regex = new Regex(@"(?<=data-react-props=""){.*}(?=<)");
var match = regex.Match(pageContents);
if (match.Success)
{
    foreach (var gr in match.Groups)
    {
        Console.WriteLine(gr);
    }
}
Bart van der Drift
  • 1,287
  • 12
  • 30
  • its nice but how about instead of matching it with data-react-props= y dont we match it with **areaCodes** or **href** can you make a regex for this one ? please – RAJA SAHAB Apr 25 '19 at 09:57
  • the above one is showing only the first successful match – RAJA SAHAB Apr 25 '19 at 09:57
0

First Approach

If you want whole object of AreaCode try first Approach.

public List<AreaCode> GetAllAreaCodes(string htmlString)
{

    List<AreaCode> areraCodes = new List<AreaCode>();

    Regex rgxAttr = new Regex(@"data-react-props=""{(.*?)}""");
    Regex rgxValue = new Regex(@"""{(.*?)}""");


    var attrResult = rgxAttr.Matches(htmlString);
    List<string> attrValues = new List<string>();

    foreach (Match match in attrResult)
    {
        var val = rgxValue.Match(match.Value);
        attrValues.Add(val.Value.Replace("\"{", "{").Replace("}\"", "}"));
    }

    foreach (var item in attrValues)
    {
        JavaScriptSerializer js = new JavaScriptSerializer();

        var dn = js.Deserialize<dynamic>(item) as Dictionary<string, object>;

        if (dn != null && dn.ContainsKey("areaCodes"))
        { 
            var abc = item.Remove(item.Length - 1, 1).Remove(0, 1).Replace(@"""areaCodes"":", "");
            areraCodes = js.Deserialize<List<AreaCode>>(abc);
        }
    }
    return areraCodes;
}
public class AreaCode
{
    public string phone_prefix { get; set; }
    public string location { get; set; }
    public string href { get; set; }
    public string[] details { get; set; }

}

Second Approach

If you need only href value then use second approach.

public List<string> GetAllHref(string htmlString)
{

    List<string> hrefList = new List<string>();

    Regex rgxAttr = new Regex(@"data-react-props=""{(.*?)}""");
    Regex rgxValue = new Regex(@"""{(.*?)}""");

    var attrResult = rgxAttr.Matches(htmlString);

    List<string> attrValues = new List<string>();

    foreach (Match match in attrResult)
    {
        var val = rgxValue.Match(match.Value);
        attrValues.Add(val.Value.Replace("\"{", "{").Replace("}\"", "}"));
    }

    dynamic ob = null;
    foreach (var item in attrValues)
    {
        JavaScriptSerializer js = new JavaScriptSerializer();
        var dn = js.Deserialize<dynamic>(item) as Dictionary<string, object>;
        if (dn != null && dn.ContainsKey("areaCodes"))
            ob = dn["areaCodes"];
    }

    var s = ob as Array;
    foreach (Dictionary<string, object> item in s)
        hrefList.Add(item["href"].ToString());

    return hrefList;
}
Umair Anwaar
  • 1,130
  • 9
  • 27