2

Sorry if this has been asked before, but I'm trying to get an array of words from a string like this:

"Exclamation! Question? \"Quotes.\" 'Apostrophe'. Wasn't. 'Couldn't'. \"Didn't\"."

The array is supposed to look like this:

[
  "exclamation",
  "question",
  "quotes",
  "apostrophe",
  "wasn't"
  "couldn't",
  "didn't"
]

Currently I'm using this expression:

sentence.toLowerCase().replace(/[^\w\s]/gi, "").split(" ");

The problem is, it removes apostrophes from words like "wasn't", turning it into "wasnt".

I can't figure out how to keep the apostrophes in words such as that.

Any help would be greatly appreciated!

var sentence = "Exclamation! Question? \"Quotes.\" 'Apostrophe'. Wasn't. 'Couldn't'. \"Didn't\".";
console.log(sentence.toLowerCase().replace(/[^\w\s]/gi, "").split(" "));
MysteryPancake
  • 1,365
  • 1
  • 18
  • 47
  • 1
    What should happen to `"\"Couldn't do\""` or `"'Couldn't do'"`? – Bergi Apr 08 '18 at 13:21
  • 1
    Try splitting on whitespace and then remove punctuation in the start and end of each individual words. – Bergi Apr 08 '18 at 13:22
  • @Bergi I'm trying to only get the words, so in both of those cases it would be "couldn't" and "do" – MysteryPancake Apr 08 '18 at 13:22
  • @DarrenSweeney I'm not replacing spaces with no spaces, only the characters I don't want. The current expression works, it just removes the apostrophes as well. – MysteryPancake Apr 08 '18 at 13:23

2 Answers2

4

That would be tricky to work around your own solution but you could consider apostrophes this way:

sentence = `"Exclamation! Question? \"Quotes.\" 'Apostrophe'. Wasn't. 'Couldn't'. \"Didn't\"."`;
console.log(
    sentence.match(/\w+(?:'\w+)*/g)
);

Note: changed quantifier from ? to * to allow multiple ' in a word.

revo
  • 47,783
  • 14
  • 74
  • 117
  • You don't have to use my horrible solution. The expression was stolen from another post anyway. Whatever makes it easier for you. Thanks a lot for this snippet! – MysteryPancake Apr 08 '18 at 13:25
  • 1
    By that I wanted to say your solution can be improved but I didn't go with it since it wouldn't be a good solution. – revo Apr 08 '18 at 13:36
  • Thanks a lot, this is a very good answer, but I felt I had to accept Jeto's since it allows for multiple apostrophes. I want to allow that just in case. https://en.wiktionary.org/wiki/Category:English_double_contractions – MysteryPancake Apr 09 '18 at 00:21
  • Also, I escaped all the quotes with \", so you probably don't need to add backticks around the string. – MysteryPancake Apr 09 '18 at 00:22
  • 1
    @MysteryPancake Please see update. You only had to change `?` to `*`. Additionally, this way you are sure substrings like `I''''''''''''ve` which have invalid sequence of apostrophes are not matched. – revo Apr 09 '18 at 07:17
1

@revo's answer looks good, here's another option that should work too:

const input = "Exclamation! Question? \"Quotes.\" 'Apostrophe'. Wasn't. 'Couldn't'. \"Didn't\".";
console.log(input.toLowerCase().match(/\b[\w']+\b/g));

Explanation:

  • \b matches at the beginning/end of a word,
  • [\w']+ matches anything that's either letters, digits, underscores or quotes (to omit underscores, you can use [a-zA-Z0-9']instead),
  • /g tells the regex to capture all occurrences that match that pattern (not just the first one).
Jeto
  • 14,596
  • 2
  • 32
  • 46
  • thanks a lot! Sorry, I'm horrible at regex, could you maybe explain the difference between yours and his? I'm not sure which one to use – MysteryPancake Apr 08 '18 at 13:32
  • 1
    @MysteryPancake Sure, I'll start by adding some comments explaining how this one works. – Jeto Apr 08 '18 at 13:33
  • 2
    @MysteryPancake revo’s regex matches words (like `couldn` or `quotes`) followed by an optional `'` and another word (like `'t`). Jeto’s regex matches any word character (e.g. letters) and apostrophes between two word boundaries, i.e. any combination of letters and apostrophes from where a word starts to where it ends. This would also allow `couldn''t`, revo’s solution wouldn’t. – Sebastian Simon Apr 08 '18 at 13:36