0

I am new to Perl and regex and I need to extract all the strings from a text file. A string is identified by anything that is wrapped by double quotes.

Example of string:

"This is string"
"1!=2"
"This is \"string\""
"string1"."string2"
"S
t
r
i
n
g"

The code:

my $fh;

open($fh,'<','text.txt') or die "$!";

undef $/;
my $text = <$fh>;

my @strings = m/".*"/g; # this returns the most out "" in example 4
my @strings2 = m/"[^"]*"/g #fixed the above issue but does not take in example 3

Edited : I want to get (1) a double quote, followed by (2) zero or more occurrences of either a non-double-quote-non-backslash or a backslash followed by any character, followed by (3) a double quote. (2) can be anything but "

The regex provided below m/"(?:\.|[^"])*"/g however when the there is a line with "string1".string2."string2" it will return "string1" string2 "string3"

Is there any wher to skip the previously matched word?

Can anyone please help?

user2763829
  • 793
  • 3
  • 10
  • 20
  • Note that what you want is (1) a double quote, followed by (2) zero or more occurrences of either a non-double-quote-non-backslash or a backslash followed by any character, followed by (3) a double quote. – Jonathan Leffler Mar 31 '14 at 06:41

1 Answers1

6

One possible approach:

/"(?:\\.|[^"])*"/

enter image description here

... that reads as:

  • match double quotation mark,
  • followed by any number of...

    --- either any escaped character (any symbol prepended by \)

    --- or any character that's not a double quotation mark

  • followed by double quotation mark

The key trick here is using alternation that'll eat any escaped symbol - including escaped double quotation mark.

Demo.

Rakesh KR
  • 6,357
  • 5
  • 40
  • 55
raina77ow
  • 103,633
  • 15
  • 192
  • 229
  • Thanks for the explanation it is working now..!! I was not aware of ?: in perl regex before this. Can anyone explain what is that or direct me a source? Thank you! – user2763829 Mar 31 '14 at 08:36
  • It's used to mark a so-called `non-capturing group` - when you need just to group series of expression in a pattern, but not to store the result of this grouping. I'd recommend checking [this thread](http://stackoverflow.com/questions/3512471/non-capturing-group) for a detailed explanation. – raina77ow Mar 31 '14 at 08:43
  • Thanks for the information. Have a much better idea on how the non-capturing group works now. I found the the above does not work if there is `"string1".$string2."string3"`, it will return `"string1" string2 string3` Is it possible to skip the previously match charater in perl regex? – user2763829 Mar 31 '14 at 08:58
  • In `"string1".$string2."string3"` both `"string1"` and `"string3"` are matched, if you use the pattern correctly. [Proof](http://regex101.com/r/oB0aD7). – raina77ow Mar 31 '14 at 10:47
  • Thanks raina77ow!! A typing error in the regex causes lots of unreasonable results – user2763829 Mar 31 '14 at 11:50