-1

My code is:

htmltoextract = new Uri("http://test");

client = new WebClient();
f = client.DownloadString(htmltoextract);
client.Dispose();
string pattern = @"(\d{12})";
Regex ex = new Regex(pattern, RegexOptions.Singleline);

MatchCollection matches = ex.Matches(f);
IFormatProvider provider = CultureInfo.InvariantCulture;
List<DateTime> dateTime = new List<DateTime>();
foreach (Match match in matches)
{
     dateTime.Add(DateTime.ParseExact(match.Value, "yyyyMMddHHmm", provider));
}

Inside f somewhere inside i have this line:

var imageUrls = ["/image2.ashx?region=is&time=201501102145&ir=false","/image2.ashx?region=is&time=201501102130&ir=false","/image2.ashx?region=is&time=201501102115&ir=false","/image2.ashx?region=is&time=201501102100&ir=false","/image2.ashx?region=is&time=201501102045&ir=false","/image2.ashx?region=is&time=201501102030&ir=false","/image2.ashx?region=is&time=201501102015&ir=false","/image2.ashx?region=is&time=201501102000&ir=false","/image2.ashx?region=is&time=201501101945&ir=false"];

I need to extract it twice to two Lists:

The first List is dateTime

The second List should be string and it should add it to it:

/image2.ashx?region=is&time=201501102145&ir=false
/image2.ashx?region=is&time=201501102130&ir=false
/image2.ashx?region=is&time=201501102115&ir=false
/image2.ashx?region=is&time=201501102100&ir=false
/image2.ashx?region=is&time=201501102045&ir=false
/image2.ashx?region=is&time=201501102030&ir=false
/image2.ashx?region=is&time=201501102015&ir=false
/image2.ashx?region=is&time=201501102000&ir=false
/image2.ashx?region=is&time=201501101945&ir=false

I have Two problems:

How do I extract the times and the strings /image2.ashx?region=is&time=201501101945&ir=false

how do I extract it all only from the part:var imageUrls = ["........

Since inside f there are other places with this times I need to extract it only from the part start from var imageUrls = [" and end with "];

Aram Tchekrekjian
  • 925
  • 11
  • 26

3 Answers3

0

Steps:

  • Use HtmlAgilityPack to get Html and extract particular <script> tag.
  • likely that script block can be matched with just reg-ex or even basic String.IndexOf to cut out list of urls
  • with just list of Urls use String.Split to cut into unique once
  • For each Url use Uri class to extract Uri.Query portion and than Get individual query parameters from Uri

Note: If JavaScript is too complicated you may need to get real JavaScript parser...

Community
  • 1
  • 1
Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
0

This is what I would do. Its not a purist solution, but it works.

(below assumes that your data format remains exactly the same for a reasonable period of time. If the people managing the source change, this code will break!)

  1. Do a regex match for the pattern "var imageUrls = [ ... ];" and move it to a seperate string.
  2. From this, chop off the var imageUrls = [ and ]; from the string.

Path A:

  1. Using string.split(), create an array of the url strings.
  2. Run a for-loop through the strings and assign them to the Uri class (eg: myUri). You can now get the value portion of each query string variable through HttpUtility.ParseQueryString(myUri.Query).Get("time");

Path B:

  1. Also chop off the "/image2.ashx?region=is&time=" and "&ir=false" leaving only what you actually want.
0

To match the time use:

(?!/image2\.ashx\?region=is&time)\d+(?=&ir=false)

DEMO

Andie2302
  • 4,825
  • 4
  • 24
  • 43