2

I have this for example:

<a href="/Forums2008/forumPage.aspx?forumId=393" title="מזג האוויר">מזג האוויר</a>

What i want to parse is first the forumId=393 then only the 393 and the link and last the name in this case hebrew so it's a bit mess here the name should be:

מזג האוויר

I can use either indexof and substring or htmlagilitypack i prefer htmlagilitypack to get all three values maybe regex is also good way.

In the end i should get this four strings:

  1. forumId=393

  2. 393

  3. מזג האוויר

  4. /Forums2008/forumPage.aspx?forumId=393

What i tried so far and it's not even close to my goal is once with htmlagilitypack and the other with downloading the html save it as file and then parsing it i thought using indexof and substring but not sure how:

HtmlAgilityPack.HtmlDocument doc =
                        Qhw.Load("http://www.tapuz.co.il/forums/forumslistnew.asp");
parseIds(doc);

WebClient webclient = new WebClient();
webclient.DownloadFile("http://www.tapuz.co.il/forums/forumslistnew.asp",
                        @"c:\testhtml\mainforums.html");
webclient.Dispose();

string[] lines = File.ReadAllLines(@"c:\testhtml\mainforums.html");
foreach(string line in lines)
{
    if (line.Contains("href") && line.Contains("forumId=") && !wholeids.Contains(line))
    {
        string tg1 = "href="";
        wholeids.Add(line);
    }
}
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{   
    idsnumbers.Add(link.InnerText);
}

idsnumbers is List global var.

Jamiec
  • 133,658
  • 13
  • 134
  • 193
Daniel van wolf
  • 393
  • 1
  • 5
  • 16
  • You should mention what about your pattern is constant. e.g. is there always `forumPage.aspx?forumId=`? – Rotem Oct 06 '15 at 09:29
  • Rotem yes each line like this is the same format the only thing that change is the name and the id number. – Daniel van wolf Oct 06 '15 at 09:30
  • 3
    Good suggestion for HtmlAgilityPack. Don't use Regex to parse HTML. I should include the obligatory link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Russ Clarke Oct 06 '15 at 09:31
  • Do you have access the the `HttpContext`? – Jamie Rees Oct 06 '15 at 09:31
  • 1
    @Russ Every time that link is posted I read it all over again :) – Rotem Oct 06 '15 at 09:34
  • @TimBiegeleisen aren't you [a day late](https://www.google.co.uk/search?q=Simchas+Torah&ie=utf-8&oe=utf-8&gws_rd=cr&ei=QZYTVr7MEMXxasnnrtAJ) – Jamiec Oct 06 '15 at 09:37
  • @Jamiec Here in Singapore (in the diaspora) today is actually Simchat-Torah. We are now in the final hours of this chag. Holidays in Israel are only 1 day :-) – Tim Biegeleisen Oct 06 '15 at 09:38
  • 1
    What on earth does this question have to do with "escape quotes from a string"? – kjbartel Oct 06 '15 at 09:39

1 Answers1

2

I would use HtmlAgilityPack, Uri.TryCreate and ParseQueryString:

string html = @"<a href=""/Forums2008/forumPage.aspx?forumId=393"" title=""מזג האוויר"">מזג האוויר</a>";
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(html);
var anchor = htmlDoc.DocumentNode.Descendants("a").FirstOrDefault();
if(anchor != null)
{
    string name = anchor.InnerText;
    string href = anchor.Attributes["href"].Value;
    Uri uri;
    if(Uri.TryCreate(href, UriKind.RelativeOrAbsolute, out uri))
    {
        var queryString = href.Substring(href.IndexOf('?')).Split('#')[0]; // because of relative uri
        var queryKeyValues = System.Web.HttpUtility.ParseQueryString(queryString);
        string forumId = queryKeyValues["forumId"];
    }
}

You could also create a fake absolute uri to avoid the string methods:

if(Uri.TryCreate(href, UriKind.RelativeOrAbsolute, out uri))
{
    if(!uri.IsAbsoluteUri)
        uri = new Uri(new Uri("http://www.google.com/"), uri);
    var queryKeyValues = System.Web.HttpUtility.ParseQueryString(uri.Query);
    string forumId = queryKeyValues["forumId"];
}
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939