-1

I have a collection (of about 61000) strings that look like

"(((((((((.(((((.&.)))))))))))))) 11,26 : 6,20 (-9.37 = -16.05 + 6.56 + 0.13) GCCAACUGACGUUGUU&AAUAAUUCAGUUGGU"

There are a variable number of spaces (1-3) between each part of the string.

Ultimately what I want is to convert this string to a javascript object:

{
    parens: "(((((((((.(((((.&.))))))))))))))",
    sRNAstart: 11,
    sRNAend: 26,
    mRNAstart: 6,
    mRNAend: 20,
    netEnergy: -9.37,
    bindingEnergy: -16.05,
    sRNAOpenEnergy: 6.56,
    mRNAOpenEnergy: 0.13,
    sequences: "GCCAACUGACGUUGUU&AAUAAUUCAGUUGGU"
}

This sounds like a job for RegEx man, bust sadly I am not him. Can anyone help me figure out a way to accomplish this?

elsherbini
  • 1,596
  • 13
  • 23
  • This sounds like a job for a parser, not regex. – Robert Harvey Jul 10 '13 at 19:51
  • 3
    The way SO works is that you have to give it a try, and we'll tell you how to fix it, you can't just ask us to do it. Robert Harvey is right, RegEx cannot solve this problem easily without extra parsing code. You need to write your own parser. Anytime you need to do bracket/parentheses matching, that's a sign that RegEx is not the tool for the job – Ruan Mendes Jul 10 '13 at 19:55
  • I appreciate that this question didn't capture the spirit of SO. I just didn't have any idea where to start. Thank you @RobertHarvey for your answer below, I didn't realize `split()` could do this. – elsherbini Jul 10 '13 at 20:37

4 Answers4

4

here is a way to use regexp to parse the string, with one internal work-around for those pesky parens:

var s="(((((((((.(((((.&.)))))))))))))) 11,26 : 6,20 (-9.37 = -16.05 + 6.56 + 0.13) GCCAACUGACGUUGUU&AAUAAUUCAGUUGGU";

var ob=s.split( /([\s]{1,4}|[,=+:()])/ )
     .filter( /./.test, /\w/ )
     .map(function(chunk, i){
        if(i===0) this.parens= s.split(" ")[0];
        this[[  "sRNAstart","sRNAend","mRNAstart","mRNAend","netEnergy",
                "bindingEnergy","sRNAOpenEnergy","mRNAOpenEnergy","sequences"
        ][i]]=  +chunk || (chunk==="0"? 0 : chunk);
       return this;
     },{})[0] ; //end ob


alert(
  JSON.stringify(
    ob,
    null,
    "\t"
  )
);

result:

 {
    "parens": "(((((((((.(((((.&.))))))))))))))",
    "sRNAstart": 11,
    "sRNAend": 26,
    "mRNAstart": 6,
    "mRNAend": 20,
    "netEnergy": -9.37,
    "bindingEnergy": -16.05,
    "sRNAOpenEnergy": 6.56,
    "mRNAOpenEnergy": 0.13,
    "sequences": "GCCAACUGACGUUGUU&AAUAAUUCAGUUGGU"
}

EDIT: removed use of non-capturing parens for more x-browser compat with OLD browsers. EDIT: adjustments: make "0" into 0, avoid setting this.parens each time, formatting, and argument cleanup.

dandavis
  • 16,370
  • 5
  • 40
  • 36
  • It is worth noting that this requires ECMA5Script 5's [`Array.filter`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/filter) and [`Array.map`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/map) which can be shimmed. – Xotic750 Jul 10 '13 at 21:32
  • 1
    judging from the task described, i'm guessing this is not for a mass-market website but rather a personal utility app running on a browser made within the last 5 years... but thanks for improving the robustness and bringing up a decent point. – dandavis Jul 10 '13 at 21:38
  • Sure, it was just a note to be aware of :). Another note is that some older browsers [`Array.split` does not handle Non-Participating Capture Groups](http://blog.stevenlevithan.com/archives/cross-browser-split) correctly either. – Xotic750 Jul 10 '13 at 21:41
  • 1
    @Xotic750: more good info. i removed the non-capturing parens just in case, but i'll let folks shim the 1.6/es5 array methods themselves if need be. – dandavis Jul 10 '13 at 21:45
  • All looks good, +1 ;) – Xotic750 Jul 10 '13 at 21:51
3

A Javascript split() with multiple delimiters should yield an array of all of the values you need.

From there, it's simple string concatenation.

Community
  • 1
  • 1
Robert Harvey
  • 178,213
  • 47
  • 333
  • 501
1

This expression will not ensure that the parentheses are matched, but it should break out everything in your pattern.

([(.&)]+)\s*(\d+),(\d+)\s*:\s*(\d+),(\d+)\s*\(([-.\d]+)\s*=\s*([-.\d]+)\s*\+\s*([-.\d]+)\s*\+\s*([-.\d]+)\)\s*([GCAU&]+)
Brigham
  • 14,395
  • 3
  • 38
  • 48
1

Here is an alternative that should also work for you and is cross-browser.

Javascript

function parse(string) {
    if (typeof string !== "string") {
        throw new TypeError("Attribute must be a string.");
    }

    var props = ["parens", "sRNAstart", "sRNAend", "mRNAstart", "mRNAend", "netEnergy", "bindingEnergy", "sRNAOpenEnergy", "mRNAOpenEnergy", "sequences"],
        array = string.split(/[)]?\s+[(:=+]?\s*|,/),
        object = {},
        value;

    if (array.length !== props.length) {
        throw new Error("String could not be converted.");
    }

    do {
        value = array.shift();
        object[props.shift()] = +value || value;
    } while (props.length);

    return object;
}

var ref = "(((((((((.(((((.&.)))))))))))))) 11,26 : 6,20 (-9.37 = -16.05 + 6.56 + 0.13) vGCCAACUGACGUUGUU&AAUAAUUCAGUUGGU";

for(var i = 0; i < 3; i += 1) {
    console.log(ref, parse(ref));
    ref = ref.replace(/(\s+)/g, function (all, whitespace) {
        return whitespace + " ";
    });
}

On jsfiddle

Xotic750
  • 22,914
  • 8
  • 57
  • 79