0

I'm very a newbie with regex and I don't use this language too much. I'm trying to get the competition_id available on this code:

    <script type="text/javascript" charset="utf-8">

      (function() {
          var block = new MatchesBlock('page_team_1_block_team_matches_summary_7', 'block_team_matches_summary', {"page":0,"bookmaker_urls":[],"block_service_id":"team_summary_block_teammatchessummary","team_id":1242,"competition_id":0,"filter":"all","new_design":false});
          block.registerForCallbacks();
          block.addCallbackObserver('page_team_1_block_team_matches_summary_7_1_1', 'changeCompetition', {"competition_id":0});
          block.addCallbackObserver('page_team_1_block_team_matches_summary_7_1_2', 'changeCompetition', {"competition_id":13});
          block.addCallbackObserver('page_team_1_block_team_matches_summary_7_1_3', 'changeCompetition', {"competition_id":135});
          block.addCallbackObserver('page_team_1_block_team_matches_summary_7_1_4', 'changeCompetition', {"competition_id":171});
          block.addCallbackObserver('page_team_1_block_team_matches_summary_7_1_5', 'changeCompetition', {"competition_id":1148});
          block.addCallbackObserver('page_team_1_block_team_matches_summary_7_1_6', 'changeCompetition', {"competition_id":732});
          block.addCallbackObserver('page_team_1_block_team_matches_summary_7_1_7', 'changeCompetition', {"competition_id":10});
          block.addCallbackObserver('page_team_1_block_team_matches_summary_7_2_1', 'filterMatches', {"filter":"all"});
          block.addCallbackObserver('page_team_1_block_team_matches_summary_7_2_2', 'filterMatches', {"filter":"home"});
          block.addCallbackObserver('page_team_1_block_team_matches_summary_7_2_3', 'filterMatches', {"filter":"away"});

          block.setAttribute('colspan_left', 4);
          block.setAttribute('colspan_right', 3); 
          block.setAttribute('has_previous_page', true);
          block.setAttribute('has_next_page', true);
          TimestampFormatter.format('page_team_1_block_team_matches_summary_7');
                })();
    </script>

the link is this: view-source:http://it.soccerway.com/teams/italy/juventus-fc/1242/

what I did so far is this:

var c = System.Text.RegularExpressions.Regex.Match(data, "'block_team_matches_summary', (\\{.*?\\})\\);\\n", 
            System.Text.RegularExpressions.RegexOptions.Singleline).Groups[1].Value;

this regex should return the all the blocks available but it return only the first element:

{"page":0,"bookmaker_urls":[],"block_service_id":"team_summary_block_teammatchessummary","team_id":1244,"competition_id":0,"filter":"all","new_design":false}

I need to get all the blocks, what I can do?

artgb
  • 3,177
  • 6
  • 19
  • 36
jode
  • 159
  • 6
  • What do you mean by "all blocks"? There is just one `{...}` substring after `'block_team_matches_summary',` and you extract it successfully. – Wiktor Stribiżew Oct 27 '17 at 13:28
  • @WiktorStribiżew if you see the content of the function it contains a list of blocks, for example: `page_team_1_block_team_matches_summary_7`, `page_team_1_block_team_matches_summary_7_1_1`, `page_team_1_block_team_matches_summary_7_1_2`, each block contains a different id – jode Oct 27 '17 at 13:32

4 Answers4

2

First, I would recommend Expresso. It is free, but you do have to register it. I find it very valuable both for working with regular expressions as well as learning to use them better. One final warning is that string parsing with regex (especially web page content; which is what yours appears to be), is especially brittle and can easily break. A regular expression that works right now, can easily start failing if the text has small changes.

With that out of the way, now for your specific question. I am assuming the result set that you are looking for is 0,13,135,171,1148,732,10 (all of the competion ids)

We will start by opening Expresso and pasting all of the text into the Sample Text (bottom left) area (make sure you are on the Test Mode tab). Now we will start writing a regular expression to find the text we are looking for. Put competition_id": into the Regular Expression area (top left). If you expand out the tree in the Regex Analyzer (top right), it will show each of the individual characters. This indicates that all of these characters will be matched literally. If you click the Run Match button, you will see a list of matches displayed in the Search Results (bottom right). Perfect, it found all 8 areas where that text appears. You can click on each of the Search Results and Expresso will highlight the corresponding area in the Sample Text. Now we need to expand this to match the number after it. If you click on the Design Mode tab you will see an area at the bottom that lists all of the regular expression symbols and what they mean. I find this area helpful for looing up various matching patterns. Change the regular expression to be competition_id":\d+

The \d means match any digit (0-9) and the + means match one or more of them. If you click Run Match you will see that each of the matches now contains the text competition_id:"<number>

If we use this regular expression in C#, it will return back all of the text, and in this case we just want the number. One final change to the regex competition_id":(\d+). Note that in the Regex Analyzer it now indicates that we have a number capture group. All this means is that portion of the match that is inside of the parenthesis will be put into its own group that we can easily extract. Click Run Match, and you will notice that the matches still contain the full text match, but now there is a sub group under each that contains the individual value.

Now back in C#, I will assume that you you that large script block in a string value named data.

string data = ...;
//Get all of the matches
MatchCollection matches = Regex.Matches(data, "competition_id\":(\\d+)");
foreach (Match match in matches)
{
    //This is the group number that we saw in expression. Group[0] will be the full match.
    Group group = match.Groups[1]; 
    //Get the value out of the group. We can do an int.Parse since we know it will only contian digits
    int competition_id = int.Parse(group.Value);
    //TODO: Do something with competition_id
}

Note: We do have to escape the regular expression when it is represented as a string.

This is only a small introduction into regular expressions. I would encourage you to play around with Expresso and poke around online. There are lots of good resources out there. The most important thing to do is practice.

Kevin B
  • 665
  • 4
  • 11
  • hi, thanks for the amazing answer, unforunately I get matches.Count = 0; the content of data is this: https://pastebin.com/AS7vMBYq – jode Oct 28 '17 at 08:06
  • Sorry had a small typo when I moved the regular expression patter into the C#. I accidently put a space between the `\d` and the `+`. This causes the regular expression to fail because the `+` only affects the thing directly before it (which was the space and not the `\d` as intended. The updated code should work. – Kevin B Oct 28 '17 at 17:32
  • Another option is using [HTMLAgilityPack](http://html-agility-pack.net) and ([Jurassic library](https://github.com/paulbartrum/jurassic) or XPath) to get what you want. – wp78de Nov 26 '17 at 05:55
-1

"competition_id":(\d+) should do the trick

Check here

Dawnkeeper
  • 2,844
  • 1
  • 25
  • 41
-1

Following will remove all the extra characters like "{} :

            string input = "{\"page\":0,\"bookmaker_urls\":[],\"block_service_id\":\"team_summary_block_teammatchessummary\",\"team_id\":1244,\"competition_id\":0,\"filter\":\"all\",\"new_design\":false}";
            input.Replace("{", "");
            input.Replace("}", "");
            string[] groups = input.Split(new char[] { ',' });

            string pattern = "\"(?'name'[^:]+)\":(?'value'.*)";
            foreach (string group in groups)
            {
                string data = group.Replace("\\","");
                Match regData = Regex.Match(data, pattern);
                Console.WriteLine("name : '{0}', value : '{1}'", regData.Groups["name"].Value, regData.Groups["value"].Value.Replace("\"",""));
jdweng
  • 33,250
  • 2
  • 15
  • 20
  • but this doesn't get all blocks, or I missed something? – jode Oct 27 '17 at 13:44
  • The for loop gets are the blocks. The string split creates an array group[]. This posting is more complicated then it looks with the curly brackets, double quotes, and some values being string and others numbers. My code works under all the conditions. – jdweng Oct 27 '17 at 15:43
-2

You need to use Regex.Matches instead of Regex.Match.

Please refer following discussion. How to find multiple occurrences with regex groups?

To do this in JavaScript, you can refer following discussion. Javascript regex get an array of all matches, not just the first occurance

Happy
  • 1