c# regex output string is not according to my expectations

Question

I am using the following code to retrieve the shipping cost from amazon.com via scanning the html source of any product's page. But the output is not what i want. Below it the code.

regexString = "<span class=\"plusShippingText\">(.*)</span>";
match = Regex.Match(htmlSource, regexString);
string shipCost = match.Groups[1].Value;
MessageBox.Show(shipCost);

It show an message box that show the return shipping cost as

&nbsp;+&nbsp;Free Shipping</span>

But actually I need the following clean text only.

Free Shipping

Please help me to solve this problem.

@NitinSawant http://www.amazon.com/Genuine-Aprilaire-213-Replacement-Filter/dp/B0039QL0JC and when I retrieve the HTML source of the product, the HTML tags for title, price, shipping costs change. You will see that in actual source the html tags are different, while using c# after retrieving, the tags are same as I mentioned in regular expression. I don't know why the tags are changing. — Muhammad Sohail, Apr 26 '14 at 06:10

score 1 · Answer 1 · answered Apr 26 '14 at 05:52

1

you jst need to remove the HTML tags you can use following function:

shipCost = System.Net.WebUtility.HtmlDecode(shipCost).Replace("+","").Trim()

answered Apr 26 '14 at 05:52

Nitin Sawant

7,278
9
52
98

This solution almost works but still there is a little problem. It is showing the output as Free Shipping the closing span tag at the end should not appear. – Muhammad Sohail Apr 26 '14 at 06:05
1

Hi Muhammad, Change the regex pattern as `(.*)<\\/span>` – Nitin Sawant Apr 26 '14 at 06:15

Ulugbek Umirov · Accepted Answer · 2014-04-26T07:25:11.477

1

Can you try the following code (though it's a bad idea to use regex for HTML parsing):

string shipCostHtml = Regex.Match(htmlSource, "(?<=<span class=\"plusShippingText\">).*?(?=</span>)").Value;
string shipCost = System.Net.WebUtility.HtmlDecode(shipCostHtml);
shipCost = shipCost.Trim(' ', '+', '\xa0');

Your regex is almost fine, you just need to replace greedy (.*) with lazy (.*?).

How could it have been solved using HtmlAgilityPack.

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlSource);
string shipCostHtml = doc.DocumentNode.SelectSingleNode("//span[@class='plusShippingText']").InnerText;
string shipCost = System.Net.WebUtility.HtmlDecode(shipCostHtml);
shipCost = shipCost.Trim(' ', '+', '\xa0');

Now you're protected against the case when Amazon decides to add some additional attributes to <span>, ex.: <span class='plusShippingText newClass'> or <span style='{color:blue}' class='plusShippingText'>, etc.

edited Apr 26 '14 at 07:25

answered Apr 26 '14 at 05:53

Ulugbek Umirov

12,719
3
23
31

what is HttpUtility ? When I paste your code in vs.net, it shows me compiler error: The name HttpUtility does not exist in current context. – Muhammad Sohail Apr 26 '14 at 06:12
1

@MuhammadSohail You can replace it with `System.Net.WebUtility`. `HttpUtility` is in `System.Web.dll` assembly. – Ulugbek Umirov Apr 26 '14 at 06:12
woww. it works now. Why it is bad idea to use regex for HTML parsing. Is there any alternative good idea? – Muhammad Sohail Apr 26 '14 at 06:24
@MuhammadSohail You can use [HtmlAgilityPack](http://htmlagilitypack.codeplex.com/) for HTML parsing (easy node selection, inner text extraction, etc.). For the reasons why regex is bad for HTML parsing I'd better give a link: http://stackoverflow.com/a/590789/1803777 – Ulugbek Umirov Apr 26 '14 at 06:31
1

@MuhammadSohail I added solution utilizing `HtmlAgilityPack`. – Ulugbek Umirov Apr 26 '14 at 07:25
thats great. I appreciate your efforts. Currently I am having a look at HtmlAgilityPack library. – Muhammad Sohail Apr 26 '14 at 07:32
I created an HtmlDocument object and tried to call the LoadHtml method but in my case it doesn't show such a method in intellicense. Is there something wrong in this code? – Muhammad Sohail Apr 26 '14 at 10:41
@MuhammadSohail How did you add reference to HtmlAgilityPack (what version)? I use the one from nuget. – Ulugbek Umirov Apr 26 '14 at 12:45
the problem is now solved, actually I didn't added the reference. Now I added and problem solved. but now I am facing another error. The error is: uri formats are not supported The uri is: http://www.amazon.com/Genuine-Aprilaire-213-Replacement-Filter/dp/B0039QL0JC/ref=sr_1_2/190-3168241-3508731?s=home-garden&ie=UTF8&qid=1392972705&sr=1-2 Can you help me to solve this issue? – Muhammad Sohail Apr 26 '14 at 13:09
Ahhh!, I was actually making a mistake. I was using Load instead of LoadHtml. But I am confused. In the statement doc.LoadHtml(htmlSource); what is htmlSource. Is is html code or URL where we are getting the HTML source from? – Muhammad Sohail Apr 26 '14 at 13:18
ok and is there a way to get the source code of any URL quickly using htmlagility pack? – Muhammad Sohail Apr 26 '14 at 16:39
@MuhammadSohail Yes, you can use `HtmlDocument doc = new HtmlWeb.Load(url);` method for it. – Ulugbek Umirov Apr 27 '14 at 10:06
I think it will load the full html. Is there any way where we can retrieve only some tags that we need. Retrieving the full HTML is too slow in case of large web pages. I am retrieving html source of Amazon links and the html source of these pages is too large about 12,000 line to 20,000. So getting the full HTML source gets some time. I need a method where we can get only the required tags, not the full HTML. Is it possible? – Muhammad Sohail Apr 27 '14 at 10:23
@MuhammadSohail I'm afraid it is not possible. You can use `HttpClient` to utilize gzip/deflate compression, but still it will get you full html code. – Ulugbek Umirov Apr 27 '14 at 10:27

c# regex output string is not according to my expectations

2 Answers2