1

I'm looking to pull pricing information from a web site. To do this, I use a regex to find all instances where the first "$" is located. From there I use substring to grab the next 7 characters, which will be e.g. $42,945. I remove all the text before the "$" and repeat the process multiple times for the different $ amount values located on the website that I am using via For loop.

The problem I have is after I trim the string to then go to the next $, the original string is recreated.

Here is the code that I am using:

WebClient client = new WebClient();
string allcontent = client.DownloadString("example.com");

string body = allcontent.Substring(140480,200000);

Regex rx = new Regex("[$]");

var numberCount = rx.Matches(body).Count;

string price = String.Empty;
string price2 = String.Empty;
int match = Int32.MaxValue;
string trimmed = String.Empty;

List<string> priceList = new List<string>();

for (int i = 0; i < numberCount; i++)
{

    trimmed = body;

    match = rx.Match(trimmed).Index;

    price = trimmed.Substring(match, 7);

    priceList.Add(price);

    trimmed = trimmed.Remove(0, match + 7);

}

Console.WriteLine(priceList[0]);
Console.WriteLine(priceList[1]);

Console.ReadKey();

Suppose the string is: ABC $300 DEF $600 GHI $120 JKF $980

After the first loop iteration I should get $300, on the second $600, and so on. Instead I am getting $300 every time.

How can I fix this to get the correct values?

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
javierma14
  • 39
  • 6
  • In your `for` loop, you assign `trimmed = body` at the top of each loop. Also, why not just use `Regex.Matches` and pull the prices with the proper regex? – NetMage Jun 25 '19 at 18:58
  • Well you could simply split the text and parse each beginning of the string like [this](https://dotnetfiddle.net/gCKxaB) – Franck Jun 25 '19 at 18:59
  • 1
    `foreach(Match match in Regex.Matches("ABC $300 DEF $600 GHI $120 JKF $980", @"\$\d+")) Console.WriteLine(match.Value);` – JohnyL Jun 25 '19 at 19:06

1 Answers1

2

The existing code starts removing from the beginning of the string, rather than at the location of the match. But we can simplify this to rely more heavily on data provided from the RegEx match:

var priceList = new List<string>();
var rx = new Regex("[$]([0-9]{1,2},)?[0-9]{3}");

using (var client = new WebClient())
{
    string body = client.DownloadString("example.com").Substring(140480,200000);
    var matches = rx.Matches(body);

    foreach (var match in matches)
    {
        priceList.Add(match);
    }   
}

Console.WriteLine(priceList[0]);
Console.WriteLine(priceList[1]);

Console.ReadKey(true);

The modified expression is so it will match the whole price value. You can see it work here:

https://dotnetfiddle.net/1DltMh

But even this code seems fragile. Using regex to parse HTML is generally frowned upon. Any small changes to the format of the web site you're scraping can seriously break this. You might do much better looking at a real HTML parser.

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794