-1

I am trying to construct a regex to by used in a python program with the following constraints

Check is there is any substring within quotes with at least 3 words (separated by spaces). Below are some examples

"Hello word \"Foo bat baz kay \" exit"

This should return true since it contains the substring "Foo bar baz kay" with at least 3 words within the quote substring.

" Hello hello \" world \" exit"

should return false.

Based on some investigation I was able to divide the problem into two separate parts

  1. Find a regex to get all substrings within quotes, so something like

    re.findall(r'"(.*?)"', s)

  2. Find a regex to get all string with more than one word

    ^\s*[A-Za-z0-9]+(?:\s+[A-Za-z0-9]+)\s$

I tried to put them together, but it doesn't yeild the expected result. Sorry I am new to regex so I am probably not doing this right. Here is the partial code. These ideas are compiled from the following posts RegEx: Grabbing values between quotation marks

Regex for one or more words separated by spaces

s = "Some string with quotes : \"Hello world example\" and another quote is \" hello world\" done"
print(re.findall(r'"(^\s*[A-Za-z0-9]+(?:\s+[A-Za-z0-9]+)*\s*$)"', s))

Please advise. Appreciate your help!

Rahul Patwa
  • 117
  • 1
  • 12
  • 2
    What did you try so far? What were the results? – running.t Dec 16 '20 at 22:12
  • Thanks for your response. I added my naive attempts with the regexs and partial code to the question. – Rahul Patwa Dec 16 '20 at 22:39
  • Try this: `r'"(?:\s*\b\w+\b\s*){3,}"'`. – ekhumoro Dec 16 '20 at 22:52
  • Thanks for you response! This seems to work however it fails on this test case : s = "Some string with quotes : \"Hello \" and another quote is \" hello world \" done". It will return ['" and another quote is "'] when I do a findAll. However it shouldn't return anything. A little confused since the first set of quotes end with Hello, and the second set begins with "hello world" – Rahul Patwa Dec 16 '20 at 23:04
  • @RahulPatwa You specifically asked to "check if there is **any** substring within quotes with at least 3 words (separated by spaces)". The substring `" and another quote is "` matches those requirements, so the output is correct. This shows your problem description is ambiguous and/or underspecified. To give a simpler example: how many valid substrings are there in `' " a b c " x y z " '`? It is equally correct to say any of 1, 2, 3 or 0 (depending on how you look at it). – ekhumoro Dec 17 '20 at 00:22
  • That's a good point. Yes, I definitely think the problem statement is ambiguous and doesn't include the the case you mentioned. From your example the expectation is 2, so that only contiguous quoted substrings are considered legitimate. In the real word, consider a paragraph containing an article which contains some quotes from prominent personalities. The goal is to detect if there is any such quote present in the specified paragraph. Thanks again for your response! – Rahul Patwa Dec 17 '20 at 00:49
  • @RahulPatwa But that example doesn't contain any contiguous quoted substrings - so I don't understand why you'd expect two. Real-world text doesn't allow overlapping quotes. It can contain *embedded* quotes, but not using the same character pairs. So, for example, `She said, "I think he said 'No'".`; but not, `He said, "I think she said "Yes"".` - because of course the latter example is ambiguous. – ekhumoro Dec 17 '20 at 17:54

1 Answers1

1

Something like this should work:

import re


def has_quote(text, word_count_threshold=3):
    quoted_string_pattern = re.compile(r'\"(.*?)\"')
    word_pattern = re.compile(r'[a-zA-Z]+')
    for quoted_string in quoted_string_pattern.findall(text):
        word_count = sum(bool(word_pattern.search(word)) for word in quoted_string.split())
        if word_count >= word_count_threshold:
            return True
    return False


examples = [
    "Hello word \"Foo bat baz kay \" exit",
    " Hello hello \" world \" exit",
    "Some string with quotes : \"Hello world\" and second quote \" hello world\" done",
    "Some string with quotes : Hello world and \"second, quote, hello, world\" done",
    "Some string with quotes : Hello world and \"second _ 1233\" done",
    " Hello hello \" world bar's\" exit"
]


for text in examples:
    print('')
    print(f'text: {text!r}')
    print(f'has quote: {has_quote(text)}')

Kapocsi
  • 922
  • 6
  • 17
  • Thanks for your response. This doesn't work for input " "Some string with quotes : \"Hello world\" and second quote \" hello world\" done" . Is it counting the original string within the quptes as well? – Rahul Patwa Dec 16 '20 at 22:41
  • What about a quoted substring with non-word characters? What would you like the regex to do for: `"Hello word \"get money$ race\" to exit"` – Kapocsi Dec 17 '20 at 00:00
  • Or `"Test example \"foo, bar, baz\" x y z\""` – Kapocsi Dec 17 '20 at 00:03
  • Good point, the behavior needs to stay the same with any special characters – Rahul Patwa Dec 17 '20 at 00:08
  • @RahulPatwa What do you consider to be "special characters"? Does this have a real-world application? What actual problem are you trying solve with this regexp? – ekhumoro Dec 17 '20 at 00:30
  • @ekhumoro I'm trying to filter a summarized text obtained from some web articles and drop the ones containing quotes. So any character such as $, comma, @, -, :, symbol etc, I consider as special characters. (anything you would expect from a real world web article) Hope that clarifies. And thanks for your response! – Rahul Patwa Dec 17 '20 at 00:45
  • @RahulPatwa Try out the revised version I just posted. – Kapocsi Dec 17 '20 at 05:07
  • 1
    @Kapocsi yeah this works. Thank you very much for your help on this! – Rahul Patwa Dec 17 '20 at 21:18