2

I have a regex which will split my string into arrays.

Everyything works fine except that I would like to keep a part of the delimiter.

Here is my regex:

(&#?[a-zA-Z0-9]+;)[\s]

in Javascript, I am doing:

var test = paragraph.split(/(&#?[a-zA-Z0-9]+;)[\s]/g);

My paragraph is as followed:

Current addresses:  &dagger;    Biopharmaceutical Research and Development<br />
&Dagger;    Clovis Oncology<br />
&sect;  Pisces Molecular <br />
||  School of Biological Sciences    
&para;  Department of Chemistry<br />

The problem is that I am getting 10 elements in my array and not 5 as I should. In fact, I am also getting my delimiter as an element and my goal is to keep the delimiter with the splited element and not to create a new one.

Thank you very much for your help.

EDIT:

I would like to get this as a result:

1. &dagger; Biopharmaceutical Research and Development<br />
2. &Dagger; Clovis Oncology<br />
3. &sect;   &sect;  Pisces Molecular <br />
||  School of Biological Sciences  
4.  &para;  Department of Chemistry<br />
Ωmega
  • 42,614
  • 34
  • 134
  • 203
Milos Cuculovic
  • 19,631
  • 51
  • 159
  • 265
  • [This has been asked before.](http://stackoverflow.com/questions/7310725/javascript-split-include-delimiters) – Elliot Bonneville Sep 07 '12 at 11:48
  • @ElliotBonneville: Where ? I can't get the solution. – Milos Cuculovic Sep 07 '12 at 11:49
  • Oh sorry, I haven't seen that it's a link. – Milos Cuculovic Sep 07 '12 at 11:51
  • I sow already that post but I can't get the answer to my problem. – Milos Cuculovic Sep 07 '12 at 11:51
  • are you trying to create a [template](http://stackoverflow.com/a/378001/36866)? – some Sep 07 '12 at 11:51
  • @some, not at all, I am only trying to parse some info from the paragraph – Milos Cuculovic Sep 07 '12 at 11:52
  • The top answer provides a RegEx that should solve your problem, and the other posters say that it's a Bad Idea (with capitals) to use RegEx on HTML. – Elliot Bonneville Sep 07 '12 at 11:52
  • @ElliotBonneville, I can't make it work. My problem is that I am getting the delimiter as a new element and I don't know why. For the regex and html, in fact this is not realy an html, I am managing only
    , nothing else.
    – Milos Cuculovic Sep 07 '12 at 11:55
  • If you're really just managing `
    ` elements then I might have a solution for you. If it ever gets more complicated though, my solution would fall apart.
    – Elliot Bonneville Sep 07 '12 at 11:56
  • Only
    and special characters can be in the HTML. My goal is to split the string in elements starting with special characters &xxxx; I can do it with my regx but the problem is that I am also catching the delimiter as a new element.
    – Milos Cuculovic Sep 07 '12 at 11:58
  • The reason you get 10 instead of five is: [ECMAScript262:5](http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf) 15.5.4.14 "If separator is a regular expression that contains capturing parentheses, then each time separator is matched the results (including any undefined results) of the capturing parentheses are spliced into the output array." – some Sep 07 '12 at 12:09
  • Thank you @some, do you maybe have an idea on how to avoid that ? – Milos Cuculovic Sep 07 '12 at 12:12
  • Could you give an example of what result you want? I'm thinking of using `match` instead. – some Sep 07 '12 at 12:13
  • a.match(/?[a-zA-Z0-9]+;[^&]*/g); – some Sep 07 '12 at 12:19
  • Answered the same answer. Take a look: http://stackoverflow.com/a/18582159/721704 – Berezh Sep 02 '13 at 23:52

3 Answers3

1

As I said in the comment, this solution (untested, by the way) will only work if you're just managing <br /> elements. Here:

var text = paragraph.split("<br />"); // now text contains just the text on each line

for(var i = 0; i<text.length-1; i++) { // don't want to add an line break to our last line
    text[i] += " <br />"; // replace the <br /> elements on each line
}

The variable text is now an array, where each element of the array is a line of the original paragraph. The linebreaks (<br />) have been added back on the end of each line. You just mentioned that you want to split on the special characters, but from what I see, each line ends in a line break, so this should hopefully have the same effect. Unfortunately I don't have the time to write up a more complete answer at the moment.

Elliot Bonneville
  • 51,872
  • 23
  • 96
  • 123
  • Thank you, but I think you have misunderstood my question. I know how to split my with a special character as delimiter. We can forget the
    . I only need to split the string in elements starting with special characters and keep tose special characters in the element.
    – Milos Cuculovic Sep 07 '12 at 12:09
  • I was afraid of that, as I didn't have much time to answer the question. – Elliot Bonneville Sep 07 '12 at 16:37
1

Try to use match instead:

var test = paragraph.match(/&#?[a-zA-Z0-9]+;\s[^&]*/g);

Updated: Added a required white-space \s match.

Explanation:

  • &#? Match & and an optional # (the question mark match previous one or zero times)

  • [a-zA-Z0-9] is a range of all upper and lower case characters and digits. If you also accept an underscore you could replace this with \w.

  • The + sign means that it should match the last pattern one or more times, so it matches one or more characters a-z, A-Z and digits 0-9.

  • The ; matches the character ;.

  • The \s matches the class white-space. That includes space, tab and other white-space characters.

  • [^&]* Once again a range, but since ^ is the first character the match is negated, so instead of matching the &-characters it matches everything but the &. The star matches the pattern zero or more times.

  • g at the end, after the last / means global, and makes the match continue after the first match and get an array of all matches.

So, match & and an optional #, followed by any number of letters or digits (but at least one), followed by ;, followed by a white-space, followed by zero or more characters that isn't &.

some
  • 48,070
  • 14
  • 77
  • 93
  • Thank you very much. If possible, can I also check if there is a space after the special character as a separator. – Milos Cuculovic Sep 07 '12 at 12:28
  • 1
    @Milos Do you want space (0x20) or any white-space (space, tab, form feed, line feed and other unicode spaces)? – some Sep 07 '12 at 12:34
  • Great, thank you very much, that's exactly what I need. If possible, could you please give me a little explanation on how works the regex you gived me. I know, I am wasting your time, sorry, but I would like to understand it, not only copy paste. :) – Milos Cuculovic Sep 07 '12 at 12:46
  • 1
    @Milos Excellent that you want to understand it! Tell me if something isn't clear. – some Sep 07 '12 at 13:09
  • Thank you veery much, that's a clear and efficient answer. I will try to get better inside of the RegEx. Thank you once again. – Milos Cuculovic Sep 07 '12 at 13:14
  • 1
    @Milos No problem! By the way, you can use [regexpal](http://www.regexpal.com/) to play around with regexps and see in realtime what it matches. – some Sep 07 '12 at 13:18
1

Using regex it is pretty simple:

var result = input.match(/&#?[^\W_]+;\s[^&]*/g);

Test it here.

Ωmega
  • 42,614
  • 34
  • 134
  • 203