5

I was trying to find the fancy quotes “ from a string using the following Perl regular expression but it returns false.

$text = "NBN “a joint venture with Telstra”";

if ($text =~ m/“/)
{
  print "found";
}

I also tried using "\x93" ascii code but still does not work. I am stuck here.

Any help is appreciated.

Regards, Allen

Allen Qin
  • 19,507
  • 8
  • 51
  • 67
  • I tested your regexp at http://www.regextester.com/ and it worked. But it only found the first quote. Regarding your question, I have not written anything in pearl but as far as I have seen other regexp in pearl people were writing, e.g., `$vmsn =~ /(.+\.vmsn)/xm;`, so your regexp would look like `/“/m`. – MPękalski Apr 04 '11 at 11:40
  • 3
    The `“` (U+201C) is not in the US-ASCII character set. – Gumbo Apr 04 '11 at 11:46
  • @MPękalski, you are right. I tested the regex using a evaluation tool and it worked. But it just didn't work with in the Perl script. – Allen Qin Apr 04 '11 at 12:09

3 Answers3

4

Depending on the encoding of the string you are trying to match, you might need to do different things. See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

If the input string is encoded in UTF-8, then you need to specify that encoding in your perl script - one way to do that is with use encoding 'UTF-8'.

You can also specify use utf8 if you want the encoding of the script itself to be UTF-8. You are probably better off, though, knowing the code point of the character you are checking for, and specifying it directly:

use utf8;
use encoding 'UTF-8';

$text = "NBN “a joint venture with Telstra”"; # Make sure to quote this string properly

if ($text =~ m/\N{U+201C}/) # “ is the same as U+201C LEFT DOUBLE QUOTATION MARK
{
  print "found";
}
Avi
  • 19,934
  • 4
  • 57
  • 70
  • 1
    @Avi: close the curly bracket `/\N{U+201C}/` – Toto Apr 04 '11 at 11:48
  • Thank you Avi. It does work! It took me nearly an hour and I still could not figure out what's wrong. Your help is greatly appreciated. I will certainly check the article you recommended - no excuses! – Allen Qin Apr 04 '11 at 12:07
  • 1
    If you "use utf8" make sure your source code is actually utf8 ;) – Øyvind Skaar Apr 04 '11 at 12:31
  • 1
    You don't need `\N{U+201C}`: `if ($text =~ m/“/)` works if you have `use utf8`. –  Apr 05 '11 at 01:13
  • I will take this opportunity to plug perl5i which turns on utf8 and avoids this sort of user head scratching. :) – Schwern Apr 05 '11 at 05:18
  • @Master Of Disaster: that is true. I was recommending using the `\N{U+201C}` syntax rather than the `use utf8`, since it allows to be much more precise, and doesn't require the source file to be in a non-ascii encoding, which can be problematic in some systems. – Avi Apr 05 '11 at 08:18
  • @daxim: I'm not sure why you edited my answer without leaving a comment. It is true that there may be better ways to do this than `use encoding` (namely, specifying the encoding of the file on `open`), but you didn't add them to the answer. In the case where all input and output should be in the same encoding, and to keep the example simple, I would prefer using the `encoding` pragma in this case. – Avi Apr 05 '11 at 08:21
  • http://stackoverflow.com/questions/492838/why-do-my-perl-tests-fail-with-use-encoding-utf8 – daxim Apr 05 '11 at 13:52
  • @daxim: I saw that question, and I understand the problems that might arise from `use encoding`. It still does provide a quick fix here that is not provided by your edit to my answer. Please refrain from editing my answer with fixes to the code that don't work. You are welcome to add an answer of your own, or add a comment explaining all the problems with mine. – Avi Apr 05 '11 at 16:17
1

See the "Demoroniser" and for your specific problem, the discussion of just the "smart" quotes bit of it on Perlmonks Re^3: Reg Ex to strip MS smart quotes.

This advice is assuming - perhaps incorrectly - that your database's "fancy quotes" have come from some piece of Microsoft software producing Windows-1252 encoded text - if you've got UTF-8 instead, Avi's already pointed you in the right direction.

daxim
  • 39,270
  • 4
  • 65
  • 132
bigiain
  • 809
  • 5
  • 8
0

I recently came across some smart quotes which I couldn't eliminate using the regex-es mentioned in the above posts only. I had to do a trick which I found out entirely by trial and error:

  • First convert to iso-8859-1 using Encode::encode.
  • Next, convert the fancy quotes (using the 4 regular expressions mentioned above).
  • Next convert the string to UTF-8 using Encode::encode (I needed this since I was using the string in an iOS app and reading it from a SQLite database using “NSString stringWithUTF8String:” - may not be relevant to you).

Hope this helps someone.

Samik R
  • 1,636
  • 3
  • 15
  • 33