11

I'm working on regular expressions homework where one question is:

Using language reference manuals online determine the regular expressions for integer numeric constants and identifiers for Java, Python, Perl, and C.

I don't need help on the regular expression, I just have no idea what identifiers look like in Perl. I found pages describing valid identifiers for C, Python and Java, but I can't find anything about Perl.

EDIT: To clarify, finding the documentation was meant to be easy (like doing a Google search for python identifiers). I'm not taking a class in "doing Google searches".

tchrist
  • 78,834
  • 30
  • 123
  • 180
Brendan Long
  • 53,280
  • 21
  • 146
  • 188
  • In regard to your clarification: Did you search for "perl identifiers?" Perhaps try it quoted? The answer results weren’t at the top but they weren’t far down or hard to recognize. The answer from tchrist certainly made it worth asking though. :) – Ashley Jan 26 '11 at 02:01
  • `man perlvar` should suffice then, eh? – tchrist Jan 26 '11 at 02:02
  • @Ashley: http://www.google.com/search?q=%22perl+identifiers%22 I don't see anything that looks useful in that. Lots of examples of Perl, and an incomplete description of variables names (basically they start with `$`). I didn't actually expect it to be this difficult to answer. Hopefully it'll help people in the future though since SO tends to show up near the top in Google searches. – Brendan Long Jan 26 '11 at 02:10
  • 1
    This is helpful: http://perldoc.perl.org/perldata.html#Variable-names – Brendan Long Jan 26 '11 at 02:56
  • 1
    There’s a difference between a variable and an identifier. Variables in Perl start with one of three sigils — `$`, `@`, or `%` — with the first two optionally taking subscripts. Identifiers may also start with `*` or `&`, but those are not variables. Also, things like subroutines, formats, and file and directory handles are identifiers but not variables. When you say `print STDERR "oops!\n"` in dative form or its equivalent `STDERR‑>print("oops")`, then both `print` and `STDERR` are idents but have no sigil. (Sigils are one of `[$@%&*]`.) – tchrist Jan 26 '11 at 02:56
  • Yeah I know it's not exhaustive, I just thought it might be helpful for anyone else looking at this in the future. – Brendan Long Jan 26 '11 at 03:31
  • @Brendan Long - as I said, several of the first page answer the question (though some links point to stolen content so I declined to link them—though I own the books). I think this thread has actually pushed down some of the previous answers already but again, there is some great info here now so it's great to have this in the pile too. – Ashley Jan 26 '11 at 03:44

4 Answers4

33

Perl Integer Constants

Integer constants in Perl can be

  • in base 16 if they start with ^0x
  • in base 2 if they start with ^0b
  • in base 8 if they start with 0
  • otherwise they are in base 10.

Following that leader is any number of valid digits in that base and also optional underscores.

Note that digit does not mean \p{POSIX_Digit}; it means \p{Decimal_Number}, which is really quite different, you know.

Please note that any leading minus sign is not part of the integer constant, which is easily proven by:

$ perl -MO=Concise,-exec -le '$x = -3**$y'
1  <0> enter 
2  <;> nextstate(main 1 -e:1) v:{
3  <$> const(IV 3) s
4  <$> gvsv(*y) s
5  <2> pow[t1] sK/2
6  <1> negate[t2] sK/1
7  <$> gvsv(*x) s
8  <2> sassign vKS/2
9  <@> leave[1 ref] vKP/REFC
-e syntax OK

See the 3 const, and much later on the negate op-code? That tells you a bunch, including a curiosity of precedence.

Perl Identifiers

Identifiers specified via symbolic dereferencing have absolutely no restriction whatsoever on their names.

  • For example, 100->(200) calls the function named 100 with the arugments (100, 200).
  • For another, ${"What’s up, doc?"} refers to the scalar package variable by that name in the current package.
  • On the other hand, ${"What's up, doc?"} refers to the scalar package variable whose name is ${"s up, doc?"} and which is not in the current package, but rather in the What package. Well, unless the current package is the What package, of course. Similary $Who's is the $s variable in the Who package.

One can also have identifiers of the form ${^identifier}; these are not considered symbolic dereferences into the symbol table.

Identifiers with a single character alone can be a punctuation character, include $$ or %!.

Identifers can also be of the form $^C, which is either a control character or a circumflex folllowed by a non-control character.

If none of those things is true, a (non–fully qualified) identifier follows the Unicode rules related to characters with the properties ID_Start followed by those with the property ID_Continue. However, it overrules this in allowing all-digit identifiers and identifiers that start with (and perhaps have nothing else beyond) an underscore. You can generally pretend (but it’s really only pretending) that that is like saying \w+, where \w is as described in Annex C of UTS#18. That is, anything that has any of these:

  • the Alphabetic property — which includes far more than just Letters; it also contains various combining characters and the Letter_Number code points, plus the circled letters
  • the Decimal_Number property, which is rather more than merely [0-9]
  • Any and all characters with the Mark property, not just those marks that are deemed Other_Alphabetic
  • Any characters with the Connector_Puncutation property, of which underscore is just one such.

So either ^\d+$ or else

^[\p{Alphabetic}\p{Decimal_Number}\p{Mark}\p{Connector_Punctuation}]+$

ought to do it for the really simple ones if you don’t care to explore the intricacies of the Unicode ID_Start and ID_Continue properties. That’s how it’s really done, but I bet your instructor doesn’t know that. Perhaps one shan’t tell him, eh?

But you should cover the nonsimple ones I describe earlier.

And we haven’t talked about packages yet.

Perl Packages in Identifiers

Beyond those simple rules, you must also consider that identifiers may be qualified with a package name, and package names themselves follow the rules of identifiers.

The package separator is either :: or ' at your whim.

You do not have to specify a package if it is the first component in a fully qualified identifier, in which case it means the package main. That means things like $::foo and $'foo are equivalent to $main::foo, and isn't_it() is equivalent to isn::t_it(). (Typo removed)

Finally, as a special case, a trailing double-colon (but not a single-quote) at the end of a hash is permitted, and this then refers to the symbol table of that name.

Thus %main:: is the main symbol table, and because you can omit main, so too is %::.

Meanwhile %foo:: is the foo symbol table, as is %main::foo:: and also %::foo:: just for perversity’s sake.

Summary

It’s nice to see instructors giving people non-trivial assignments. The question is whether the instructor realized it was non-trivial. Probably not.

And it’s hardly just Perl, either. Regarding the Java identifiers, did you figure out yet that the textbooks lie? Here’s the demo:

$ perl -le 'print qq(public class escape { public static void main(String argv[]) { String var_\033 = "i am escape: ^\033"; System.out.println(var_\033); }})' > escape.java
$ javac escape.java
$ java escape | cat -v
i am escape: ^[

Yes, it’s true. It is also true for many other code points, especially if you use -encoding UTF-8 on the compile line. Your job is to find the pattern that describes these startlingly unforbidden Java identifiers. Hint: make sure to include code point U+0000.

There, aren’t you glad you asked? Hope this helps. Or something. ☺

Community
  • 1
  • 1
tchrist
  • 78,834
  • 30
  • 123
  • 180
  • 5
    Note to Brendan - if you actually want to present this War and Piece sized answer to your teacher as part of the documentation showing why your Perl identifier regex takes 2 full pages, and he starts questioning the sanity of the person being cited, just tell him that the person who provided the answer could easily write a book on Perl programming. – DVK Jan 26 '11 at 01:59
  • Wow this is going to be hard. Thanks a lot though. The Java thing I read said that they can be made up of letters, numbers or basically any unicode letter. For that one I just assumed that "javaletters" was defined for me. But why is U+000 a valid character? Just to make it hard to write a Java compiler in C? :D – Brendan Long Jan 26 '11 at 02:05
  • 5
    @Brendan: I’m sure that your instructor didn’t think it would be this hard, either. I fear that here as in so many academic pursuits, the best way to get a good mark is to give the answer that’s expected of you, not the answer that attempts to accurately model a reality that’s far more complicated than the person giving the assigment ever imagined. – tchrist Jan 26 '11 at 02:16
  • 1
    @Brendan: Regarding Java, I really have no idea. It’s something I discovered by accident. Lots of ugly non-`\w` control characters; it’s shameful. I wrote a Perl program that exhaustively tries compiling all possible code points in Java identifiers, and I’ll be damned if I can see the pattern. It is quite nonsensical, but there is just enough of a pattern (for example, code points of property `\p{Sc}`, i.e. Currency_Symbol) mixed in there to make you wonder whether someone wasn’t trying to do something deliberate. But they seem to have screwed it up royally. That’s all I know now. – tchrist Jan 26 '11 at 03:05
  • 1
    @tchrist - might be worth asking as an SO question. Someone just might know what the idea was – DVK Jan 26 '11 at 04:21
  • @DVK: the idea seems to be to only reject problematic chars (e.g: whitespace characters), but since my mind-reading abilities are not 100% accurate, I'll refrain from answering that to Tom's question at http://stackoverflow.com/questions/4838507/why-does-java-allow-control-characters-in-its-identifiers/4838947#4838947 – ninjalj Jan 29 '11 at 20:42
  • WARNING: if your identifier begins with a special character, like "#bar", `${'#bar'}` or `'#bar'->$*` isn't enough. That references "main::#bar". You have to supply a package name, like `'test::#bar'->$*` or `(__PACKAGE__ . '::#bar')->$*`. – alexchandel May 24 '19 at 21:18
5

The homework requests that you use the reference manuals, so I'll answer in those terms.

The Perl documentation is available at http://perldoc.perl.org/. The section that deals on variables is perldata. That will easily give you a usable answer.

In reality, I doubt that the complete answer is available in the documentation. There are special variables (see perlvar), and "use utf8;" can greatly affect the definition of "letter" and "number".

$ perl -E'use utf8; $é=123; say $é'
123

[ I only covered the identifier part. I just noticed the question is larger than that ]

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • 1
    The point of the assignment is the write the regular expression. Teachers at this school probably just assume we all know Perl, even though it's not taught in any classes here (and why learn Perl if I know Python?). – Brendan Long Jan 26 '11 at 00:19
  • 7
    It's always good to learn new languages. It gives you fresh perspectives on how to approach many problems. – ikegami Jan 26 '11 at 00:23
  • 5
    @Brendan: The point is so you can get real work done in real time. – tchrist Jan 26 '11 at 00:46
  • @Brendan - Perl an Python may be somewhat interchangeable as far as usability but not always so (language comparisons aside, library sets are different). I'd strongly recommend learning both (and learning Perl if you already grok Python shouldn't be overly complicated). – DVK Jan 26 '11 at 01:32
  • @Brendan: You would learn Perl because it is much better at regular expressions than Python is. It also has a more flexible OO system. – tchrist Jan 26 '11 at 02:14
  • What's wrong with `re`? It seems like anything I could do with regular expressions that I can't do with `re` is complicated enough that I shouldn't use regular expressions for it ;) – Brendan Long Jan 26 '11 at 02:26
  • 7
    @Brendan: It’s not so much what’s “wrong” with Python regexes as what they lack compared to Perl’s: real and complete Unicode support including none of this wide build crud, [UTS#18 standards compliance](http://www.unicode.org/reports/tr18/), [structured a.k.a. grammatical regexes](http://stackoverflow.com/questions/764247/why-are-regular-expressions-so-controversial/4053506#4053506) for separating declaration from execution, recursion for stuff like `s/\((?:[^()]*+|(?0))*\)//g` to strip nested parens, tailoring of properties and casing, backtracking control, debugging&instrumentation, &c&c&c! – tchrist Jan 26 '11 at 02:44
5

The perlvar page of the Perl documentation has a section at the end roughly outlining the allowable syntax. In summary:

  1. Any combination of letters, digits, underscores, and the special sequence :: (or '), provided it starts with a letter or underscore.
  2. A sequence of digits.
  3. A single punctuation character.
  4. A single control character, which can also be written as caret-{letter}, e.g. ^W.
  5. An alphanumeric string starting with a control character.

Note that most of the identifiers other than the ones in set 1 are either given a special meaning by Perl, or are reserved and may gain a special meaning in later versions. But if you're just trying to work out what is a valid identifier, then that doesn't really matter in your case.

Anon.
  • 58,739
  • 8
  • 81
  • 86
  • 1
    @mscha: The homework is to create a regular expression. Finding the docs themselves is, at best, a distraction. – Anon. Jan 26 '11 at 00:30
  • 7
    I'm afraid that represents a **VERY** simplified version of reality. You're going to have to examine the lexer’s `scan_ident` function, plus the `UTF8_IS_START`, `isALNUM_utf8`, and `UTF8_IS_CONTINUED` macros. To a first approximation an identifier is something with only Alphabetic, Mark, Decimal_Number, or Connector_Punctuation type characters in it. You’ve also forgotten the MJD-style variables like `${^TAINT}` and `${^UNICODE}`. But that doesn’t mean you can’t have `${ "!##%^&--!!" }` type variables; those are perfectly valid. They just can’t be lexicals. **HTH&&HAND!** – tchrist Jan 26 '11 at 00:42
  • @tchrist: `^TAINT` (and similarly `^UNICODE`) are examples of set 5 - an alphanumeric string starting with a control character. Additionally, it appears that the asker's assignment is to produce a regex according to the language reference, rather than according to reality (which is a much more challenging task). – Anon. Jan 26 '11 at 00:55
  • 1
    @mscha - in case of Perl, an in-experienced person would not be able to come up with a useful correct complete definition from documentation. blah blah blah only perl can parse Perl blah blah blah. Bear in mind that this comment is coming from someone who readily yells at people asking HW questions on SO, down-votes such, and forgoes easy upvotes by refusing to post HW answers that'd take 1 sec to compose. – DVK Jan 26 '11 at 01:27
  • 1
    @DVK: My HW answer took me longer than 1 sec to compose. :) – tchrist Jan 26 '11 at 01:47
  • @tchrist - I had seen some Qs that would literally take 1 sec to type the answer to. – DVK Jan 26 '11 at 01:53
  • @DVK: Fast, Easy, or Correct: choose two. – tchrist Jan 26 '11 at 02:04
  • @Anon - The question also explicitly states that no help in regexes is required, only guidance on what the Perl language is. – OrangeDog Jan 26 '11 at 11:07
1

Having no official specification (Perl is whatever the perl interpreter can parse) these can be a little tricky to discern.

This page has examples of all the integer constant formats. The format of identifiers will need to be inferred from various pages in perldoc.

OrangeDog
  • 36,653
  • 12
  • 122
  • 207
  • Note also that anything that pretends there’s a such thing as a negative constant doesn’t understand the grammar; just run `perl -MO=Concise,-exec -le '$x = -3**$y'` to find out the whys and the wherefores. – tchrist Jan 26 '11 at 00:46
  • Perl6 is actually to spec, but yes... Perl5 and earlier do use the "The language is what the compiler says it is" method. – Jeff Ferland Jan 26 '11 at 00:58
  • 1
    @Autocracy: That overstates matters. – tchrist Jan 26 '11 at 02:05