Typeglobs and Function Prototypes
By: Jeff "japhy" Pinyan
Email comments to
japhy@pobox.com
If you want to make your program's functions look more like built-in functions,
without needing to use parentheses everywhere, or explicitly using references
with your complex functions, you should learn how to use function prototypes,
and how to work with typeglobs to write nice-looking, efficient code:
@pair = pop2 @array; # how'd he do that?
# this is spiffy too!
while (($a,$b) = one_each(@foo,@bar)) {
print "$a is from \@foo, $b is from \@bar.\n";
}
How Typeglobs Work
With the exception of variables declared via my(), all variables in
Perl are held in, and accessible through, the symbol table. The symbol
table can be looked at as a hash of hashes, where the names of the hashes are
package names, like main:: or LWP::UserAgent::, and the keys
of the hashes these represent are symbol names. Take the following simple
code:
package Foobar;
$name = "Simon";
@friends = ("Joe", "Marcy", "Jack");
%friends = (
$name => [ $friends[0], $friends[1] ],
"Marcy" => [ $name, $friends[0] ],
"Joe" => [ $name, $friends[2] ],
); # what complex relationships!
for my $person (@friends) {
print "The friends of $person are: ";
print join(", ", @{ $friends{$person} }), "\n";
}
The resulting symbol table of %Foobar:: could be displayed treating
it like a regular hash:
while (my ($key,$val) = each %Foobar::) {
print "$key => $val\n";
}
name => *Foobar::name
friends => *Foobar::friends
What's in *FOO?
Well, first, remember that the variables $person, $key, and
$value were declared with my(), so Perl does not place them
in the symbol table. Now, we're left with an interesting display; Perl stores
the name of the variable in the symbol table as the key, with a rather
funny-looking expression as the value. That value is a typeglob, a
variable type in Perl that represents a symbol's entry in the symbol table.
The symbol table holds the information about scalars, arrays, hashes, functions,
and IO handles (filehandles, dirhandles, sockets). It is currently not possible
to extract a format from a typeglob.
*FOO =====> $FOO (scalar)
@FOO (array)
%FOO (hash)
&FOO (subroutine)
*FOO (typeglob)
FOO (filehandle)
FOO (dirhandle)
FOO (socket)
FOO (format name)
To access a typeglob, you use an asterisk (*) as the preceeding
symbol. When a typeglob is printed, its fully qualified name is given; that
includes whatever package (namespace) it belongs to:
package CGI::Fast;
print *foo;
would display *CGI::Fast::foo. To access a specific piece of the
typeglob, use the *foo{THING} syntax, where THING is one of
the following strings: SCALAR ARRAY HASH CODE GLOB IO. Using
*foo{ARRAY} returns the symbol table entry for the array @foo,
not the array itself, and not exactly a reference to it. You can get at
the actual array by doing @{*foo} or @{*foo{ARRAY}}. Because
*foo holds the symbol table entry for @foo, you can say
*foo->[$idx] and get $foo[$idx]; this works with
*foo->{key} and *foo->(@arguments), too. (But who would
want to?)
Symbol Aliasing
Finally, since @{\@foo} and @{*foo} mean the same thing, you
can do magical things like aliasing, and selective aliasing:
# make $jer, @jer, %jer, ...
# aliases for $Gerald, @Gerald, %Gerald, ...
*jer = *Gerald;
*jer = \*Gerald; # same thing
*jer = *Gerald{GLOB}; # same thing
# only make &short an alias for &really_long_name
*short = *really_long_name{CODE};
*short = \&really_long_name; # same thing
# this leaves $short and $really_long_name
# two distinct variables
Perhaps it seems confusing that *a = *b and *a = \*b do the
same thing; this is because if the RHS (right-hand side) of a glob assignment
is a reference, only the referenced value is aliased (as shown in the second
example which only aliases the subroutine name). However, the RHS is a glob
reference, so "only" the glob is aliased (which means the entire glob is
aliased).
Symbol aliasing allows you to say "all foo variable are really the
same as bar variables" or, if you want to be selective and picky,
"only the $foo variable is really the $bar variable, and all
the other foo variables are separate from bar variables."
Namespaces
One thing that *foo doesn't hold is the namespace foo::
which can be accessed via main::foo::. All namespaces (including
main:: itself) are held in main::; main:: can be
abbreviated as ::. Because namespaces act like hashes, you can use
any of the following to get at the symbol name in the package
pkg:
*main::pkg::name
*::pkg::name
$main::pkg::{'name'}
$main::{'pkg::'}{'name'}
${'main::'}{'pkg::'}{'name'}
$::pkg::{'name'}
$::{'pkg::'}{'name'}
${'::'}{'pkg::'}{'name'}
And, of course, you can leave the reference to main:: or ::
out of any of those. You'll notice that when you use the symbol table like a
hash, you use subscripts, and the leading symbol is a $ -- this
returns the typeglob, which is the value of the key you give the hash. The
hash ability is existent so you can dynamically access symbols. Be very
careful, though, if you use the direct syntax, to be sure to use the *,
and not a $.
*foo::bar; # the *bar typeglob, in the foo package
$foo::bar; # the $bar variable, in the foo package
You can also use a string on the right side of a typeglob assignment, and the
string will be converted by Perl into a typeglob itself:
*foo = "bar"; # *foo = *main::bar
*foo = "this::that"; # *foo = *this::that
This is good, because it works under strict, so if you have a variable
holding a symbol that you want to alias, you don't need to do something like
*foo = *{$symbol}, because Perl will automatically put in the
*{ ... } for you. Aliasing entire namespaces is not possible yet in
Perl.
Be sure, though, if you have a variable holding a package name, the string
either ends in ::, or you separate the variable from the trailing
:: it needs:
$pkg = "this::that::";
*foo = ${$pkg}{foo};
$pkg = "this::that";
*foo = ${$pkg . "::"}{foo};
*foo = ${"${pkg}::"}{foo};
*foo = ${"$pkg\:\:"}{foo};
Doing ${"$pkg::"}{foo} would have Perl interpret $pkg:: as
a namespace pkg, and the lookup would be botched, and it would try
assigning an undefined value to the glob, which Perl doesn't like doing. If
you're using -w, Perl will whine about your trying to assign the glob
an undefined value (which it will not do):
#!/usr/bin/perl -w
$foo = 10;
*foo = undef;
print $foo;
Undefined value assigned to typeglob at - line 4.
10
Passing Globs
Let's say you wanted a function to receive a filehandle as its argument. So,
seeing how other Perl functions work, you decide to make a logit()
function will print a message (defaulting to the value of $_) to a
given filehandle, along with some other information. You predeclare it so that
you don't need to use parentheses.
#!/usr/bin/perl -w
use strict;
sub logit; # predeclared
my @messages = ("yeah", "whatever");
open LOG, ">>logfile" or die "can't append to logfile: $!";
logit LOG, "Starting record";
for (@messages) {
logit LOG; # defaulting to $_
}
close LOG;
"Uh oh," you think to yourself. "How do I assign that filehandle to a
variable?" You try using a scalar:
sub logit {
my ($fh,$msg) = @_
my $now = localtime;
$msg = $_ unless defined $msg; # for defaulting
print $fh "[$now] $msg\n";
}
Sadly, your program doesn't work like you hoped. And turning on -w and
use strict can only help, so you do. Perl reports with the following:
Useless use of a constant in void context at line 8
Can't locate object method "logit" via package "IO::Handle" at line 8
Oh my. Line 8 is where first called the logit() function. Since Perl
knows LOG is a filehandle, it's assuming logit LOG is calling
a builtin function logit() on LOG. And these types of calls
are transparently IO::Handle methods:
print FH @args;
# same as
require IO::Handle;
FH->print(@args);
So what can you do to make this work how you wanted? Well, you can put in
some parentheses, but then strict whines that LOG is a
bareword. So you'd need to send it as *LOG or \*LOG (a
glob, or a reference to a glob).
Now your program works just nicely. Luckily, the print() function
works the same way when it gets a bareword like LOG, a glob like
*LOG, or a reference to a typeglob like \*LOG, or when it
gets a variable containing any of these. Yes, we could use 'LOG' in
our function call, but that's styleless.
So what does $fh look like if the function is sent a string, a glob,
or a reference to a glob? Let's do a little test:
myfunc('LOG');
myfunc(*LOG);
myfunc(\*LOG);
sub myfunc {
my $fh = $_[0];
print "$fh\n";
}
will give us these results:
LOG
*main::LOG
GLOB(0xc6268)
So we see that there is a difference, depending on how we send our
filehandle to the function -- in the first case, we don't send the filehandle
at all, just a string. In the second case, we have the name of the glob that
the filehandle is located in. In the third case, we have a reference to that
glob. print() does The Right Thing.
If you want to pass a large array to a function, but don't want to copy the
data, you can pass it by reference:
@array = (...);
func(\@array);
# or
func(*array);
sub func {
my $aref = $_[0];
# or
local *alias = $_[0];
}
It is important to note that, since my() variables don't exist in the
symbol table, my(*glob) is not allowed, as it would be a meaningless
glob. So, we use local(), because it temporarily changes the symbol
table to give a variable (or a glob) a new value. If we use *alias,
we can access the array via $alias[$idx]; if we use $aref,
we have to dereference like so: $aref->[$idx]. Personal benchmarks
show that it is probably best to use $aref. Sending it as a glob or
a reference didn't appear to have any major affect.
If you're puzzled how func() can accept a reference to an array or a
glob, and assign it to a scalar or typeglob, and still work regardless of the
combination, let me show you how it works out:
func(...) assignment becomes
1. \@array $aref $aref = \@array
2. \@array *alias *alias = \@array
3. *array $aref $aref = *array
4. *array *alias *alias = *array
In case 1, $aref is just a reference to @array. In case 2,
we have selective aliasing. In case 3, we're putting a glob in a scalar, as
we did when we sent myfunc() the glob *FILE, so the scalar
can be used wherever the glob would be used; $aref->[$idx] is just
like *array->[$idx], which is $array[$idx]. And in case 4,
we do a full symbol aliasing.
Function Prototypes
"So how can I code with style? I want to make functions that look like the
built-in ones. How could I write a function like each(), except it
would take two arrays, and iterate over them at the same time, returning the
first element of each array, then the next, etc.?"
The column is "Coding with Style", you know.
The answer is to use function prototypes. They are a brief note to Perl to
tell it how to treat the arguments it gets. You can also use a forward
declaration of your function, so that you don't need parentheses when using
strict; or, simply define your function before it gets used. Note to
C/C++ programmers: prototypes do NOT name the variables to be used in the
function (you must assign those normally, getting them from @_), and
a function can only have one prototype.
For our first prototype, let's make a simple function that emulates Perl's
scalar() function:
sub my_scalar ($); # forward declaration
$a = my_scalar localtime; # <-- semicolon!
sub my_scalar ($) { # prototypes MUST match
return $_[0];
}
Notice the semicolon after the function's forward declaration! Anyway, what
the ($) prototype means is that the function only gets one argument,
and this argument will be put in scalar context. This is really all the
scalar() function does. Sending the function a list of values will
result in Perl whining at you that the function has gotten too many arguments,
but sending it an array will return the number of elements in the array.
Prototypes can be made up the following symbols:
- $
- this argument is in scalar context
- \$
- this argument must be a simple scalar (not just a scalar value)
- @
- slurp the rest of the arguments as a list
- \@
- this argument must be a simple array
- %
- slurp the rest of the arguments as a list (does not check for an even
number of elements!)
- \%
- this argument must be a real hash
- *
- this argument is interpreted as a reference to a glob
- \*
- this argument must be a real typeglob
- &
- argument must be an anonymous subroutine (prefixed with sub,
unless it is the first argument)
- ;
- the rest of the arguments after the semicolon are optional
It is important to realize that variables sent corresponding to a prototype
such as \@ and \% are stored in @_ as references.
By "simple scalar", that means you can pass $this, but not
$that[0]. You can "get around" that by doing the ever nasty
${ \( $that[0] ) } but there's no reason to -- don't use prototypes
in that case, or use a temporary variable.
So if you wanted to make your filehandle function look more like a regular
Perl function, you could do:
sub logit (*;$); # predeclare
logit LOG;
logit LOG, "some message";
sub logit (*;$) {
my ($fh,$msg) = @_
my $now = localtime;
$msg = $_ unless defined $msg; # for defaulting
print $fh "[$now] $msg\n";
}
That would tell Perl, even under strict, that logit LOG will
work like we'd like it to. If we had printed the value of $fh, we'd
see *main::LOG. We could rewrite the function to use local():
sub logit (*;$) {
local *FH = shift;
my $msg = @_ ? shift : $_; # another way...
my $now = localtime;
$msg = $_ unless defined $msg; # for defaulting
print FH "[$now] $msg\n";
}
This works as expected, which is A Good Thing.
Examples
Let's make the functions shown at the beginning of the article.
# pop2 ARRAY - remove (and return) the
# last two elements of an array
# ex: ($a,$b) = pop2 @array;
sub pop2 (\@) { # proto. demands an array
my $aref = $_[0];
return splice(@$aref, -2);
# or return (pop @$aref, pop @$aref);
}
This function looks like a regular function, and the magic of references goes
on behind the scenes so you needn't send it \@array or *array
like you would if you didn't use prototypes.
As for the one_each() function, you need to use a little trick. You
store a variable that only the function sees, by putting the function in its
own block, and using a my() variable that can't be seen outside of
the block. This variable is used to hold the current index to be used in the
array lookups.
# one_each ARRAY, ARRAY [, LIMIT] - iterate over two
# arrays at once, with an optional limit of iterations
# returns an element of each array on each call
# ex: while (($a,$b) = one_each(@A,@B,5)) { ... }
{
my ($pos,$lim,$as,$bs) = (0,0,0,0);
sub one_each (\@\@;$) { # optional third arg
my ($aref,$bref,$a,$b);
($aref,$bref) = @_;
if ($pos == 0) {
$lim = $_[2];
($as,$bs) = (scalar @$aref, scalar @$bref);
my ($min,$max) = ($as,$bs)[$as > $bs, $as < $bs];
$lim = $min if not defined $lim;
$lim = $max if $lim > $max;
$lim = int abs $lim;
}
# end iterations
if ($pos == $lim) {
($pos,$lim,$as,$bs) = (0,0,0,0); # reset
return;
}
($a,$b) = ($aref->[$pos], $bref->[$pos]);
$pos++;
return ($a,$b);
}
}
This function is very intelligent -- it uses four private variables: the
index to retrieve ($pos), the maximum number of iterations
($lim), the size of the first array ($as), and the size of
the second array ($bs). It only determines $as and
$bs once, the first time. As a default, this function will only go
through the arrays as many times as the shorter array has elements. Until it
has gone through the prescribed (or default) number of times, it returns the
next element of the two arrays it gets, and increments the position counter.
When it is to stop, it resets all its private variables, and returns nothing,
which evaluates as false in scalar context, and an empty list in list context.
This function is a good introduction to closures, to be explained in a
future article. Closures are a way of using private data that remains in
existence after it appears to go out of scope.
Traversing the Symbol Table
The following code will go through the main:: symbol table, and tell
you all the scalars, arrays, hashes, and subroutines you have defined. It is
paraphrased and borrowed from "Programming Perl" and the Perl documentation.
for (sort keys %main::) {
# do typeglob assignment, and treat main::
# like a hash, to do a dynamic symbol lookup
local *sym = $main::{$_};
print "\$$_ defined\n" if defined $sym;
print "\@$_ defined\n" if defined @sym;
print "\%$_ defined\n" if defined %sym;
print "\&$_ defined\n" if defined &sym;
print "FH $_ defined\n" if defined *sym{IO};
}
The reason we test for defined *sym{IO} is a special reason, and a
special use of the *sym{THING} syntax. Remember that it returns a
reference, not the actual data. Thus, defined *sym{SCALAR} would
always be true, because \$sym is always defined, because it is a
reference. However, because filehandles are not top-level data types in Perl
they can be tested for definedness through a reference to the filehandle,
returned via *sym{IO} syntax. This is because *sym{IO}
returns a reference to an IO::Handle object, which is how Perl stores
filehandles. Dirhandles and formats are still very difficult to get at from
the symbol table; filehandles and dirhandles can be used via their glob or
reference-to-glob form, but formats are a totally different story, but there
does not appear to be a need to ever GET the format itself from the glob, and
the glob form should work fine.
The code can be modified to start in main::, and work your way through
any namespace and print all the defined variables. Quite an interesting thing
to try on your own. Hint: keep track of what namespaces you've seen, so you
don't keep going through main::.
Also, as the defined() documents say, defined(@array) will
not return what you might expect it to. Testing for scalar @array
might be more of what you want. In this case, though, defined(@array)
is what I want to use.
Resources
To read about typeglobs, read perldata and perlref. Also,
perlmod discusses typeglobs and symbol tables. For more on function
prototypes, read the perlsub documentation.
This is all available on your computer (if you have perl) by running
perldoc section, or going to
http://language.perl.com/.