ruby extract string between two string

Question

I am having a string as below:

str1='"{\"@Network\":{\"command\":\"Connect\",\"data\":
{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"'

I wanted to extract the somename string from the above string. Values of xx:xx:xx:xx:xx:xx, somename and 123456789 can change but the syntax will remain same as above.

I saw similar posts on this site but don't know how to use regex in the above case. Any ideas how to extract the above string.

As you are new to development, I would recommend you to use data structures to put your information into. I mean, arrays, hashes, etc. Then, retrieving it, should not be that hard ;) — Ramon Araujo, Aug 14 '13 at 08:31
As Ramon Araujo says, try to use common used data struture available in ruby such as Array or Hash. If your string is data that you retrieve from the outside world, than look at JSON for parsing in into a Hash — epsilon, Aug 14 '13 at 08:35
Thanks for the suggestion Ramon. I am getting the above data from some other application and trying to use the values in my code. — neo, Aug 14 '13 at 08:37
I tried JSON to parse the above string but its giving invalid json syntax. — neo, Aug 14 '13 at 08:39
possible duplicate of [Parsing a JSON string in ruby](http://stackoverflow.com/questions/5410682/parsing-a-json-string-in-ruby) — Jonas Elfström, Aug 14 '13 at 09:16

score 5 · Answer 1 · answered Aug 14 '13 at 08:56

5

Parse the string to JSON and get the values that way.

require 'json'
str = "{\"@Network\":{\"command\":\"Connect\",\"data\":{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"
json = JSON.parse(str.strip)
name = json["@Network"]["data"]["Name"]
pwd = json["@Network"]["data"]["Pwd"]

answered Aug 14 '13 at 08:56

Jonas Elfström

30,834
6
70
106

Thanks Jonas for the quick reply. – neo Aug 14 '13 at 09:06

score 1 · Accepted Answer · answered Aug 14 '13 at 09:31

Since you don't know regex, let's leave them out for now and try manual parsing which is a bit easier to understand.

Your original input, without the outer apostrophes and name of variable is:

"{\"@Network\":{\"command\":\"Connect\",\"data\":{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"

You say that you need to get the 'somename' value and that the 'grammar will not change'. Cool!.

First, look at what delimits that value: it has quotes, then there's a colon to the left and comma to the right. However, looking at other parts, such layout is also used near the command and near the pwd. So, colon-quote-data-quote-comma is not enough. Looking further to the sides, there's a \"Name\". It never occurs anywhere in the input data except this place. This is just great! That means, that we can quickly find the whereabouts of the data just by searching for the \"Name\" text:

inputdata = .....
estposition = inputdata.index('\"Name\"')
raise "well-known marker wa not found in the input" unless estposition

now, we know:

where the part starts
and that after the "Name" text there's always a colon, a quote, and then the-interesting-data
and that there's always a quote after the interesting-data

let's find all of them:

colonquote = inputdata.index(':\"', estposition)
datastart = colonquote+3
lastquote = inputdata.index('\"', datastart)
dataend = lastquote-1

The index returns the start position of the match, so it would return the position of : and position of \. Since we want to get the text between them, we must add/subtract a few positions to move past the :\" at begining or move back from \" at end.

Then, fetch the data from between them:

value = inputdata[datastart..dataend]

And that's it.

Now, step back and look at the input data once again. You say that grammar is always the same. The various bits are obviously separated by colons and commas. Let's try using it directly:

parts = inputdata.split(/[:,]/)
=> ["\"{\\\"@Network\\\"",
 "{\\\"command\\\"",
 "\\\"Connect\\\"",
 "\\\"data\\\"",
 "\n{\\\"Id\\\"",
 "\\\"xx",
 "xx",
 "xx",
 "xx",
 "xx",
 "xx\\\"",
 "\\\"Name\\\"",
 "\\\"somename\\\"",
 "\\\"Pwd\\\"",
 "\\\"123456789\\\"}}}\\0\""]

Please ignore the regex for now. Just assume it says a colon or comma. Now, in parts you will get all the, well, parts, that were detected by cutting the inputdata to pieces at every colon or comma.

If the layout never changes and is always the same, then your interesting-data will be always at place 13th:

almostvalue = parts[12]
=> "\\\"somename\\\""

Now, just strip the spurious characters. Since the grammar is constant, there's 2 chars to be cut from both sides:

value = almostvalue[2..-3]

Ok, another way. Since regex already showed up, let's try with them. We know:

data is prefixed with \"Name\" then colon and slash-quote
data consists of some text without quotes inside (well, at least I guess so)
data ends with a slash-quote

the parts in regex syntax would be, respectively:

\"Name\":\"
[^\"]*
\"

together:

inputdata =~ /\\"Name\\":\\"([^\"]*)\\"/
value = $1

Note that I surrounded the interesting part with (), hence after sucessful match that part is available in the $1 special variable.

Yet another way:

If you look at the grammar carefully, it really resembles a set of embedded hashes:

\"
{ \"@Network\" :
  { \"command\" : \"Connect\",
    \"data\" :
    { \"Id\" : \"xx:xx:xx:xx:xx:xx\",
      \"Name\" : \"somename\",
      \"Pwd\" : \"123456789\"
    }
  }
}
\0\"

If we'd write something similar as Ruby hashes:

{ "@Network" =>
  { "command" => "Connect",
    "data" =>
    { "Id" => "xx:xx:xx:xx:xx:xx",
      "Name" => "somename",
      "Pwd" => "123456789"
    }
  }
}

What's the difference? the colon was replaced with =>, and the slashes-before-quotes are gone. Oh, and also opening/closing \" is gone and that \0 at the end is gone too. Let's play:

tmp = inputdata[2..-4]   # remove opening \" and closing \0\"
tmp.gsub!('\"', '"')     # replace every \" with just "

Now, what about colons.. We cannot just replace : with =>, because it would damage the internal colons of the xx:xx:xx:xx:xx:xx part.. But, look: all the other colons have always a quote before them!

tmp.gsub!('":', '"=>')     # replace every quote-colon with quote-arrow

Now our tmp is:

{"@Network"=>{"command"=>"Connect","data"=>{"Id"=>"xx:xx:xx:xx:xx:xx","Name"=>"somename","Pwd"=>"123456789"}}}

formatted a little:

{ "@Network"=>
   { "command"=>"Connect",
     "data"=>
     {  "Id"=>"xx:xx:xx:xx:xx:xx","Name"=>"somename","Pwd"=>"123456789" }
   }
}

So, it looks just like a Ruby hash. Let's try 'destringizing' it:

packeddata = eval(tmp)
value = packeddata['@Network']['data']['Name']

Done.

Well, this has grown a bit and Jonas was obviously faster, so I'll leave the JSON part to him since he wrote it already ;) The data was so similar to Ruby hash because it was obviously formatted as JSON which is a hash-like structure too. Using the proper format-reading tools is usually the best idea, but mind that the JSON library when asked to read the data - will read all of the data and then you can ask them "what was inside at the key xx/yy/zz", just like I showed you with the read-it-as-a-Hash attempt. Sometimes when your program is very short on the deadline, you cannot afford to read-it-all. Then, scanning with regex or scanning manually for "known markers" may (not must) be much faster and thus prefereable. But, still, much less convenient. Have fun.

Thanks for the great explanation quetzalcoatl. – neo Aug 14 '13 at 09:55 — neo, Aug 14 '13 at 09:55

ruby extract string between two string

2 Answers2