2

Okay I realise there are a lot of regex questions out there but thank you for taking the time

Edited to be the solved code

https://stackoverflow.com/a/25791942/8926366 held the answer

I have a text file with quotes in them that I want to put into an ArrayList<String>. To do this I am using Scanner and File methods, and I wanted to familiarise myself with regex because it seems like a really efficient way of doing it. Except that I can't seem to get it to work of course!

I managed to cobble together the following regex token courtesy of guides and peoples solutions that I understand about 85% of:

(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1) now I understand it this way:

(?<=       # positive lookbehind group1
  (        # for this new group group2
   ["']    # the characters I am looking for
   \b      # word boundary anchor
  )        # end group2
)          # end group1
(?:        # non-capturing group3
  (?=      # lookahead group4
    (\\?)  # I still have no idea what this means exactly
  )        # end group 4
  \2       # matching the contents of the 2nd group in the expression.
)          # end group3
*?         # lazy 
(?=\1)     # look ahead for group 1

I will now confirm it does not work haha

This however works (sort of, removed ' from [\"] because of my french keyboard, it would be too long to separate commas from french quotation marks, its not that big a deal in this case)

([\"])((?:(?=(\\?))\3.)*?)\1

with input:

"Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.”

"He who thinks great thoughts, often makes great errors” – Martin Heidegger

it gives:

Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.

He who thinks great thoughts, often makes great errors

For all those confused over why their regex isn't working for a txt file- try using notepad++ or something to replace all the various possible quote (make sure to check the closing and opening characters!) with one kind of quote

Here is the method: (that works wonderfully now)


  public class WitticismFileParser {

   ArrayList<String> witticisms;
   Scanner scan;
   String regex="([\"])((?:(?=(\\\\?))\\3.)*?)\\1"; //"(?s)([\"])((?<quotedText>(?=(\\\\?))\\3.)*?)(?<[\"])";
   public ArrayList<String> parse(String FILE_PATH){

       witticisms = new ArrayList<>();
       Pattern pattern = Pattern.compile(regex);


       try{
           File txt= new File(FILE_PATH);
           scan= new Scanner(txt);
           String line="";
           Matcher matcher;
           matcher=pattern.matcher(line);

           while(scan.hasNext()){
               line=scan.nextLine();
               matcher=matcher.reset(line);

               if (matcher.find()){
                   line=matcher.group(2);
                   witticisms.add(line);
                   System.out.println(line);
               }

           }

       }catch(IOException e){
           System.err.println("IO Exception- "+ e.getMessage());
           e.printStackTrace();

       }catch(Exception e){
           System.err.println("Exception- "+e.getMessage());
           e.printStackTrace();
       }finally{
           if(scan!=null)
               scan.close();       
       }

       return witticisms;
   }

}

leaving troubleshooting here

When I just make it print line directly as the scanner gets it, I see the input text is as expected. I made sure to reformat the .txt so that all the quotation marks were the same too

Anyways thank you for any help with this, I am getting a horrible headache from reading regex documentation

Thanks to anyone who answered!!

K_wrecks
  • 41
  • 6
  • `(\?)` matches a single question mark as a captured group. \ is used to escape `?` as `?` is a regex token – Matt.G Apr 04 '19 at 20:47
  • I doubt if I could understand your problem truly, but let's try one solution: find() method moves matched parts of a string one step forward. Then based on my own experience, check if you have called find method in your watched variables or not. if yes, then the watched variable moves the cursor forward and then you would face the exception. – Amin Heydari Alashti Apr 04 '19 at 20:52
  • As @Matt.G indicated (\\?) matches and captures a single literal question mark. the \\ is to change the meaning of literal meaning \. otherwise it would capture literal backslash along with the literal question mark. And yes, when you're trying to learn regex, it is liable to make your head spin but it is indeed a powerful tool. – SanV Apr 04 '19 at 21:29
  • FYI: be cautious about browser incompatibilities. e.g., positive lookbehind does not work for Java/JavaScript on Microsoft Edge but works on Google Chrome. – SanV Apr 04 '19 at 22:30
  • Ok, some info is here. This `(?:(?=(\\?))\1.)*?` sandwiched between anything, matches everything. It's roughly equavelent to `.*?`. What is your question ? –  Apr 04 '19 at 23:08

1 Answers1

0

Why not simply use the regex below?

"(?<textBetweenQuotes>[\s\S]*?)"

" matches the character " literally.
(?<textBetweenQuotes> is the start of a named capture group.
[\s\S]*? matches any character including newlines between zero or an infinite amount of times but lazily (so stopping as soon as possible).
) is the end of the named capture group.
" matches the character " literally.

If you cannot use named capture groups in your program you can always use the regex below without it and replace the quotes out of it.

"[\s\S]*?"
Vqf5mG96cSTT
  • 2,561
  • 3
  • 22
  • 41