3

I am wondering how to extract words (substrings) from a string, if said strings are between two specific characters. In my case, I am looking for the start character to be a white space and the final character to be a comma like so:

var str = "Hit that thing man! and a one, two, three, four, five, six, seven or eight";

Result:

var result = ["one", "two", "three", "four", "five", "six", "seven", "eight"];

I am wondering if a regex is possible, or perhaps good old javascript will be the straight forward solution.

I have tried the following so far:

var result = str.split(/[,\s]+/);

But to no avail since it does the following behavior incorrectly:

  1. Grabs the entire string before one.
  2. Grabs the space before the desired letter.

Bonus round: Can I include the last letter eight in the result by adding to the desired regex/javascript solution?

Any help is very appreciated!

AGE
  • 3,752
  • 3
  • 38
  • 60
  • 1
    A regular expression is definitely possible. – Pointy Oct 16 '15 at 14:31
  • Do you use only Latin characters in your text? – VisioN Oct 16 '15 at 14:31
  • Yes and I expect all these characters to be in the exact same format, meaning space followed by string followed by comma, which interestingly could mean a range of languages – AGE Oct 16 '15 at 14:31
  • 1
    Anjd how, according to your definition of what you want, did `eight` end in result array? – Tomáš Zato Oct 16 '15 at 14:36
  • that is correct for the *bonus round* since it would be awesome to know how to do this with regex all in one shot, however I am most interested in the regex solution to the original question. – AGE Oct 16 '15 at 14:38
  • 1
    eight is not a letter ! it's a word ! – Onilol Oct 16 '15 at 14:39
  • 1
    Well, I created a regex that works with your sentence and, unlike some others, does't fail on end of string (eg `"one, two, three"` matches all three). – Tomáš Zato Oct 16 '15 at 14:44
  • @Onilol ... plop so true – AGE Oct 16 '15 at 14:56

3 Answers3

2

TLDR: regex101.com

Why not just get all matches? It seems simple than spliting the stuff.

var re = /(?:^|\s)([^,\s]+)(?:,|$| or)/g,
    s = "Hit that thing man! and a one, two, three, four, five, six, seven or eight",
    m,
    matches = [];

// Matches once and then as long as there are some matches
do {
    m = re.exec(s);
    if (m) {
        matches.push(m[1]);
    }
} while (m);

console.log(m);

This produces:

["one", "two", "three", "four", "five", "six", "seven", "eight"]

If you don't want to match on or, just remove it:

/(?:^|[\s])([^,\s]+)(?:,|$)/g

And you can also add and which often appears instead of or in such lists:

/(?:^|[\s])([^,\s]+)(?:,|$| and| or|)/g

The ^ and $ allow to match at the beginning and end of string.

Community
  • 1
  • 1
Tomáš Zato
  • 50,171
  • 52
  • 268
  • 778
  • Your regex includes commas, edit it not to include them on the result, nice answer FYI – AGE Oct 16 '15 at 14:46
  • My regex does not include commas in matches. Just run the code and don't call me "*Edit it ..." as if I was your employee. – Tomáš Zato Oct 16 '15 at 14:47
  • If you use `Hit that or TEST thing man! and a one, two, three, four, five, six, seven or eight` then `that` is selected too ...? https://regex101.com/r/cT2pQ9/2 – davidkonrad Oct 16 '15 at 14:49
  • Of course, in my solution `or` has the same role as comma. – Tomáš Zato Oct 16 '15 at 14:49
  • @TomášZato, OK - just wondered, did not exactly understood the question as such - why is TEST not selected but the string prior to `or`? Anyway, not good at regex - if it is solves OP's question, then it is correct. – davidkonrad Oct 16 '15 at 14:52
  • @davidkonrad this one and the previous answers all solve the problem, Tomas simply explained his in a versatile way which is nice in more than a regex way and that really does address the question as a whole – AGE Oct 16 '15 at 15:01
  • 1
    You don't need to put `\s` in brackets, it already is a character class by itself, so `[\s]` is the same as `\s` – Aaron Oct 16 '15 at 15:02
  • @AaronGOUZIT I originally to put some more characters there, like `(`, which may be connected to word without space. – Tomáš Zato Oct 16 '15 at 15:08
  • @AGE, yes - I just wondered why `that` was selected, not `TEST`, that was all. Now I realize that `,` and `or` is equal in the regex - `Hit that thing man! and a one or two or three or four or five or six or seven , eight` produces the same result as the original `str`... https://regex101.com/r/cT2pQ9/3 – davidkonrad Oct 16 '15 at 15:10
  • 1
    @TomášZato sure, I just think you should edit your answer so nobody think \s can only be used in brackets. – Aaron Oct 16 '15 at 15:10
1
str.match(/\b[A-z]+(?=(, )|( or )|$)/g)

It matches a word from its start if this word is followed by a comma, the word "or" or the end of the text.

You can try it here.

Aaron
  • 24,009
  • 2
  • 33
  • 57
  • I noted how on your regex101 link, it captured the eight, but it did not include it on the console.log when I tested it myself, care to explain why? – AGE Oct 16 '15 at 14:44
  • @AGE That is strange, it works when I test it in my console under Chrome. Did you made sure the eight is at the end of the string? That's the criteria to match it in my regex – Aaron Oct 16 '15 at 14:46
  • I did actually a few times to be really sure since you where the first one to completely cover the answer to the question. I granted someone else the correct answer since they also got the bonus round 100% right. Feel free to look here let me know if I messed up, cause otherwise you deserve it: http://jsfiddle.net/AGE/7usjzk3w/ – AGE Oct 16 '15 at 14:53
  • @AGE your str variable does not contains the "eight", so no wonder it the pattern doesn't match it ;) It works when I add " or eight" at the end of the string. – Aaron Oct 16 '15 at 14:56
  • @AGE although it won't work with your question variable since it ends with a question mark rather than the last word you want to match. You can fix this by replacing the "$" in my regex with a "\?" : `str.match(/\b[A-z]+(?=(, )|( or )|\?)/g)` – Aaron Oct 16 '15 at 14:58
  • as expected it was my bad completely I forgot to include 'and eight' in my test, everything happened so fast, now to determine who answered it first :) – AGE Oct 16 '15 at 14:59
  • 1
    @AGE No problem as long as you have been answered, it is the important part ;) – Aaron Oct 16 '15 at 15:00
1

The final or is the only actual problem, because JavaScript does not support lookbehinds. For that reason you cannot use a single regex to capture words "between two specific characters" - you always end up with at least the left one in your result.

I come up with this: mangle the string into form by replacing or with a comma and adding one to the end. Then it's a straightforward regex:

var result = str.concat(',').replace(' or ',',').match(/\w+(?=,)/g);

It cannot work with split because that would assign the entire first part of the sentence to one.

Jongware
  • 22,200
  • 8
  • 54
  • 100
  • @AGE: according to my test it should also extract `eight` from your original test string. Note the `concat` that adds a comma to the end, exactly for that purpose, so if fulfils the condition `\w+(?=,)`. – Jongware Oct 16 '15 at 14:57