1

I'm trying to capture dialogue from a novel -- any text that appears within quotation marks.

My problem is that when a quotation spans paragraphs, it's traditional to have a new quotation mark begin each paragraph, even though the previous set wasn't closed. For example:

The letter was to this effect:

"My dear Lizzy,

"I wish you joy. If you love Mr. Darcy half as well as I do my dear Wickham, you must be very happy. It is a great comfort to have you so rich, and when you have nothing else to do, I hope you will think of us. I am sure Wickham would like a place at court very much, and I do not think we shall have quite money enough to live upon without some help. Any place would do, of about three or four hundred a year; but however, do not speak to Mr. Darcy about it, if you had rather not.

"Yours, etc."

The regex I've been using (JS style) is

(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*'

and it doesn't account for this. I'm not sure what I can do to handle this problem and would love a tip. (And it's not important that a single quotation be a single group, just that all quotations get captured -- the letter example above could be three groups.)

It may help that in my text, each line is a paragraph, and a paragraph never contains newlines. So if the line ends with quotation marks open, and the next line begins with a quotation mark, that could work? But that's getting beyond my ability to express in regex, I'm very new to it.

Community
  • 1
  • 1
GreenTriangle
  • 2,382
  • 2
  • 21
  • 35
  • @Mandy8055 Unfortunately that doesn't seem to grab separate quotes on one line (like https://regex101.com/r/FA7oW1/3 ), but it helps to see so thank you! – GreenTriangle May 22 '20 at 05:07

1 Answers1

1

You can use the below regex for your requirements:

(?=["'])"([^"\\\n]*(?:\\[\s\S][^"\\\n]*)*)[",.!]|'([^'\\\n]*(?:\\[\s\S][^'\\\n]*)*)[',.!]

Explanation of above regex:

(?=["']) - Represents a positive lookahead which looks for the at least one " or ' in front.

([^"\\\n]*(?:\\[\s\S][^"\\]*)*) - Represents the capturing group which captures everything after " except closing ", \ or a newline character.

[",.!] - Represents the ending ", ,, . or !. You can although add other ending symbols if you like here.

| - Represents alternation.

'([^'\\\n]*(?:\\[\s\S][^'\\\n]*)*)[',.!] - Same as above, except for that this matches for any single quoted dialog.

You can see the demo of the above regex here.

IMPLEMENTATION IN JAVASCRIPT:

const myRegexp = /(?=["'])"([^"\\\n]*(?:\\[\s\S][^"\\]*)*)[",.!]|'([^'\\]*(?:\\[\s\S][^'\\]*)*)['",.!]/gm;
const myString = `"Then", he said, "we go home."

"My dear Lizzy,

"I wish you joy. If you love Mr. Darcy half as well as I do my dear Wickham, you must be very happy. It is a great comfort to have you so rich, and when you have nothing else to do, I hope you will think of us. I am sure Wickham would like a place at court very much, and I do not think we shall have quite money enough to live upon without some help. Any place would do, of about three or four hundred a year; but however, do not speak to Mr. Darcy about it, if you had rather not.

"Yours, etc."

"Hello World"
"Hey!Theererererzffzfzfzfz zfzfbcnhdzxghxhxhxhx"

Hello World!I am "Some random text"

"Hey There! This is Some Text!!!! which does not contain quotes.

"thefbjbbssbjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjcbsjcb cbcslablaffblafblbflabflbfabfalafbalbabfaflaflzbzbhavsjdjdbeblbvsbvskbv"

'Hello!!! This is single quoted example.'

'Hello!!!' This is single quoted example.'Here is the one.
`;
let tempString = "";
let groupMatch = "";

match = myRegexp.exec(myString);
while (match != null) {
  groupMatch = match[1] != null?match[1]:match[2];
  tempString = tempString.concat(groupMatch + "\n");
  match = myRegexp.exec(myString);
}
console.log(tempString);

Note

If this answer helped; please read this wonderful answer which is the basis of my answer and is one of the most efficient answers in terms of performance.

Community
  • 1
  • 1