2

I retrieve data from the net containing real geodesic expressions, by that I mean degrees, minutes and seconds with Unicode symbols: U+00B0, U+2032 and U+2033, named Degree, Prime and Double Prime. Example:

my $Lat = "48° 25′ 43″ N";

My objective is to convert such an expression first to degrees and then to radians to be used in a Perl module I am writing that implements the Vincenty inverse formula to calculate ellipsoidal great-circle distances. All my code objectives have been met with pseudo geodesics, such as "48:25:43 N", but of course, this is hand entered test data, not real world data. I am struggling with crafting a regular expression that can split this real data as I now do pseudo data, as in:

my ($deg, $min, $sec, $dir) = split(/[\s:]+/, $_[0], 4); # this works

I have tried many regular expressions including

/[°′″\s]+/ and
/[\x{0B00}\x{2032}\x{2033}\s]/+

all with dismal results, such as $deg = "48?", $min = "?", $sec = "25′43″ N" and $dir = undef. I've encapsulated the code inside braces {} and included within that scope use utf8; and use feature 'unicode_strings'; all with nada results.

input data example:

my $Lat = "48° 25′ 43″ N"; 

Expected output:

$deg = 48, $min = 25, $sec = 43 and $dir = "N"
Mustofa Rizwan
  • 10,215
  • 2
  • 28
  • 43
perlboy
  • 64
  • 7
  • too broad !!! you should focus in what in expected input and output... please provide some sample input and expected output set – Mustofa Rizwan Jan 31 '18 at 05:37
  • I did supply an input data example: my $Lat = "48° 25′ 43″ N"; What I want is $deg = 48, $min = 25, $sec = 43 and $dir = "N". The problem is those Unicode symbols Degree, Prime and Double Prime, and including them in a regular expression used in Perl split. I can't see how I can make my question any clearer. – perlboy Jan 31 '18 at 05:48
  • Could there be fractional coordinates as in `$Lat = "48° 25′ 43.5″ N"`? – Tim Pietzcker Jan 31 '18 at 06:41
  • Yes there can and indeed there are fractional seconds, and Rizwan's solution works correctly with either and with or without embedded whitespace, since that is another variation in the data. Vincenty inverse is super accurate and can utilize fractional seconds if present in the data. And if I need to convert back to DMS from radians I can use non Unicode symbols. Again, thank you both for solution/insights. – perlboy Jan 31 '18 at 07:11

2 Answers2

4

You may try this regex to split the string:

[^\dNSEW.]+

Regex Demo

Sample source: ( run here )

my $str = '48° 25′ 43″ N';
my $regex = qr/[^\dNSEW.]+/p;
my ($deg, $min, $sec, $dir) = split $regex, $str;
Mustofa Rizwan
  • 10,215
  • 2
  • 28
  • 43
  • Thank you. That works perfectly. I was just beginning to experiment with [\D\s]+ and was getting everything except the $dir token. Thanks again. – perlboy Jan 31 '18 at 06:46
  • One question, if I may. What is the purpose of that "." after the "W" in the character class, since the expression works correctly with the dot or without? – perlboy Jan 31 '18 at 06:58
  • 1
    that is actually an afterthought... suppose if the deg our and min can be in fraction that might be handy .. innit ? :) ? Although I dont think i have seen such think like that ... yet.. i saw comment so thought about putting it in... you may remove the . if you don't need it – Mustofa Rizwan Jan 31 '18 at 06:59
  • No, I have and do need it. You hit it out of the park because fractional seconds are becoming increasingly common. – perlboy Jan 31 '18 at 07:16
0

My bad! Pilot error!

The original regex I posted, and was struggling with was:

/[\x{0B00}\x{2032}\x{2033}\s]/+

The error(s) are where I placed the '+' character and the hex value of the degree character. That regex should have been written:

/[\x{B0}\x{2032}\x{2033}\s]+/

The answer from @Rizwan was illuminating but I was determined to make regular expressions in Perl work with Unicode, so I persevered, and now this is my solution:

use utf8;
no warnings;

my $dms = "48° 25′ 43.314560″ N";
my $regex = qr/[\x{B0}\x{2032}\x{2033}:\s]+/p; # some geodesics do use ':'
my ($deg, $min, $sec, $dir) = split $regex, $dms;
printf("\$deg: %s, \$min: %s, \$sec: %s, \$dir: %s\n",
       $deg, $min, $sec, $dir);

Like it or not, Unicode is the future.

perlboy
  • 64
  • 7
  • Why `no warnings`? And why is there a `/p` flag on your regex? – melpomene Feb 02 '18 at 01:56
  • Here is what perlre says about /p: "Preserve the string matched such that ${^PREMATCH} , ${^MATCH} , and ${^POSTMATCH} are available for use after matching." I think that will eventually be deprecated but don't know when. I'm using v5.18.2 on macOS 10.13.3. I often use those variables when I execute with the debugger. Also, my standard first line shebang is: #!/usr/bin/perl -w. Always! That -w causes perl to announce whenever I attempt to print a double-wide character, which 'use utf8;' will enable, so 'no warning;' suppresses them. The regex will work just fine without the /p. – perlboy Feb 02 '18 at 20:59
  • I know what `/p` does, but it makes no sense in `split` (IMHO it makes no sense on `qr//` either). Don't use `-w`; it's been obsolete since 2000 with the introduction of `use warnings;`. Also, don't disable all warnings; just disable a specific category. Also, don't just suppress warnings; fix your broken code. – melpomene Feb 08 '18 at 22:12
  • If your program "works perfectly" but produces "Wide character" warnings, it works by accident, sort of. (It would probably fail if you run it on an EBCDIC platform or if Perl decides to change the internal representation of Unicode strings in the future.) The proper solution is to tell Perl what encoding to use on your filehandles (e.g. `binmode STDOUT, ':encoding(UTF-8)';`). – melpomene Feb 10 '18 at 11:39
  • With split you can use `(` `)` to get captured strings, but that has nothing to do with `/p`. `perlre` doesn't explain why you put something that has no effect in your code (especially sample code on a SO answer, which tends to be copied and used by other people). The "fix your broken code" comment was referring to the "Wide character" warning, which does indicate real encoding issues. In general, blindly silencing all warnings is a bad idea. Also, had you posted *less* code (no `/p` and `no warnings;`), I wouldn't've said anything. – melpomene Feb 10 '18 at 21:01