Typeglobs and Function Prototypes

By: Jeff "japhy" Pinyan

Email comments to japhy@pobox.com
If you want to make your program's functions look more like built-in functions, without needing to use parentheses everywhere, or explicitly using references with your complex functions, you should learn how to use function prototypes, and how to work with typeglobs to write nice-looking, efficient code: @pair = pop2 @array; # how'd he do that? # this is spiffy too! while (($a,$b) = one_each(@foo,@bar)) { print "$a is from \@foo, $b is from \@bar.\n"; }

How Typeglobs Work

With the exception of variables declared via my(), all variables in Perl are held in, and accessible through, the symbol table. The symbol table can be looked at as a hash of hashes, where the names of the hashes are package names, like main:: or LWP::UserAgent::, and the keys of the hashes these represent are symbol names. Take the following simple code: package Foobar; $name = "Simon"; @friends = ("Joe", "Marcy", "Jack"); %friends = ( $name => [ $friends[0], $friends[1] ], "Marcy" => [ $name, $friends[0] ], "Joe" => [ $name, $friends[2] ], ); # what complex relationships! for my $person (@friends) { print "The friends of $person are: "; print join(", ", @{ $friends{$person} }), "\n"; } The resulting symbol table of %Foobar:: could be displayed treating it like a regular hash: while (my ($key,$val) = each %Foobar::) { print "$key => $val\n"; } name => *Foobar::name friends => *Foobar::friends

What's in *FOO?

Well, first, remember that the variables $person, $key, and $value were declared with my(), so Perl does not place them in the symbol table. Now, we're left with an interesting display; Perl stores the name of the variable in the symbol table as the key, with a rather funny-looking expression as the value. That value is a typeglob, a variable type in Perl that represents a symbol's entry in the symbol table. The symbol table holds the information about scalars, arrays, hashes, functions, and IO handles (filehandles, dirhandles, sockets). It is currently not possible to extract a format from a typeglob. *FOO =====> $FOO (scalar) @FOO (array) %FOO (hash) &FOO (subroutine) *FOO (typeglob) FOO (filehandle) FOO (dirhandle) FOO (socket) FOO (format name) To access a typeglob, you use an asterisk (*) as the preceeding symbol. When a typeglob is printed, its fully qualified name is given; that includes whatever package (namespace) it belongs to: package CGI::Fast; print *foo; would display *CGI::Fast::foo. To access a specific piece of the typeglob, use the *foo{THING} syntax, where THING is one of the following strings: SCALAR ARRAY HASH CODE GLOB IO. Using *foo{ARRAY} returns the symbol table entry for the array @foo, not the array itself, and not exactly a reference to it. You can get at the actual array by doing @{*foo} or @{*foo{ARRAY}}. Because *foo holds the symbol table entry for @foo, you can say *foo->[$idx] and get $foo[$idx]; this works with *foo->{key} and *foo->(@arguments), too. (But who would want to?)

Symbol Aliasing

Finally, since @{\@foo} and @{*foo} mean the same thing, you can do magical things like aliasing, and selective aliasing: # make $jer, @jer, %jer, ... # aliases for $Gerald, @Gerald, %Gerald, ... *jer = *Gerald; *jer = \*Gerald; # same thing *jer = *Gerald{GLOB}; # same thing # only make &short an alias for &really_long_name *short = *really_long_name{CODE}; *short = \&really_long_name; # same thing # this leaves $short and $really_long_name # two distinct variables Perhaps it seems confusing that *a = *b and *a = \*b do the same thing; this is because if the RHS (right-hand side) of a glob assignment is a reference, only the referenced value is aliased (as shown in the second example which only aliases the subroutine name). However, the RHS is a glob reference, so "only" the glob is aliased (which means the entire glob is aliased).

Symbol aliasing allows you to say "all foo variable are really the same as bar variables" or, if you want to be selective and picky, "only the $foo variable is really the $bar variable, and all the other foo variables are separate from bar variables."

Namespaces

One thing that *foo doesn't hold is the namespace foo:: which can be accessed via main::foo::. All namespaces (including main:: itself) are held in main::; main:: can be abbreviated as ::. Because namespaces act like hashes, you can use any of the following to get at the symbol name in the package pkg: *main::pkg::name *::pkg::name $main::pkg::{'name'} $main::{'pkg::'}{'name'} ${'main::'}{'pkg::'}{'name'} $::pkg::{'name'} $::{'pkg::'}{'name'} ${'::'}{'pkg::'}{'name'} And, of course, you can leave the reference to main:: or :: out of any of those. You'll notice that when you use the symbol table like a hash, you use subscripts, and the leading symbol is a $ -- this returns the typeglob, which is the value of the key you give the hash. The hash ability is existent so you can dynamically access symbols. Be very careful, though, if you use the direct syntax, to be sure to use the *, and not a $. *foo::bar; # the *bar typeglob, in the foo package $foo::bar; # the $bar variable, in the foo package You can also use a string on the right side of a typeglob assignment, and the string will be converted by Perl into a typeglob itself: *foo = "bar"; # *foo = *main::bar *foo = "this::that"; # *foo = *this::that This is good, because it works under strict, so if you have a variable holding a symbol that you want to alias, you don't need to do something like *foo = *{$symbol}, because Perl will automatically put in the *{ ... } for you. Aliasing entire namespaces is not possible yet in Perl.

Be sure, though, if you have a variable holding a package name, the string either ends in ::, or you separate the variable from the trailing :: it needs: $pkg = "this::that::"; *foo = ${$pkg}{foo}; $pkg = "this::that"; *foo = ${$pkg . "::"}{foo}; *foo = ${"${pkg}::"}{foo}; *foo = ${"$pkg\:\:"}{foo}; Doing ${"$pkg::"}{foo} would have Perl interpret $pkg:: as a namespace pkg, and the lookup would be botched, and it would try assigning an undefined value to the glob, which Perl doesn't like doing. If you're using -w, Perl will whine about your trying to assign the glob an undefined value (which it will not do): #!/usr/bin/perl -w $foo = 10; *foo = undef; print $foo; Undefined value assigned to typeglob at - line 4. 10

Passing Globs

Let's say you wanted a function to receive a filehandle as its argument. So, seeing how other Perl functions work, you decide to make a logit() function will print a message (defaulting to the value of $_) to a given filehandle, along with some other information. You predeclare it so that you don't need to use parentheses. #!/usr/bin/perl -w use strict; sub logit; # predeclared my @messages = ("yeah", "whatever"); open LOG, ">>logfile" or die "can't append to logfile: $!"; logit LOG, "Starting record"; for (@messages) { logit LOG; # defaulting to $_ } close LOG; "Uh oh," you think to yourself. "How do I assign that filehandle to a variable?" You try using a scalar: sub logit { my ($fh,$msg) = @_ my $now = localtime; $msg = $_ unless defined $msg; # for defaulting print $fh "[$now] $msg\n"; } Sadly, your program doesn't work like you hoped. And turning on -w and use strict can only help, so you do. Perl reports with the following: Useless use of a constant in void context at line 8 Can't locate object method "logit" via package "IO::Handle" at line 8 Oh my. Line 8 is where first called the logit() function. Since Perl knows LOG is a filehandle, it's assuming logit LOG is calling a builtin function logit() on LOG. And these types of calls are transparently IO::Handle methods: print FH @args; # same as require IO::Handle; FH->print(@args); So what can you do to make this work how you wanted? Well, you can put in some parentheses, but then strict whines that LOG is a bareword. So you'd need to send it as *LOG or \*LOG (a glob, or a reference to a glob).

Now your program works just nicely. Luckily, the print() function works the same way when it gets a bareword like LOG, a glob like *LOG, or a reference to a typeglob like \*LOG, or when it gets a variable containing any of these. Yes, we could use 'LOG' in our function call, but that's styleless.

So what does $fh look like if the function is sent a string, a glob, or a reference to a glob? Let's do a little test: myfunc('LOG'); myfunc(*LOG); myfunc(\*LOG); sub myfunc { my $fh = $_[0]; print "$fh\n"; } will give us these results: LOG *main::LOG GLOB(0xc6268) So we see that there is a difference, depending on how we send our filehandle to the function -- in the first case, we don't send the filehandle at all, just a string. In the second case, we have the name of the glob that the filehandle is located in. In the third case, we have a reference to that glob. print() does The Right Thing.

If you want to pass a large array to a function, but don't want to copy the data, you can pass it by reference: @array = (...); func(\@array); # or func(*array); sub func { my $aref = $_[0]; # or local *alias = $_[0]; } It is important to note that, since my() variables don't exist in the symbol table, my(*glob) is not allowed, as it would be a meaningless glob. So, we use local(), because it temporarily changes the symbol table to give a variable (or a glob) a new value. If we use *alias, we can access the array via $alias[$idx]; if we use $aref, we have to dereference like so: $aref->[$idx]. Personal benchmarks show that it is probably best to use $aref. Sending it as a glob or a reference didn't appear to have any major affect.

If you're puzzled how func() can accept a reference to an array or a glob, and assign it to a scalar or typeglob, and still work regardless of the combination, let me show you how it works out: func(...) assignment becomes 1. \@array $aref $aref = \@array 2. \@array *alias *alias = \@array 3. *array $aref $aref = *array 4. *array *alias *alias = *array In case 1, $aref is just a reference to @array. In case 2, we have selective aliasing. In case 3, we're putting a glob in a scalar, as we did when we sent myfunc() the glob *FILE, so the scalar can be used wherever the glob would be used; $aref->[$idx] is just like *array->[$idx], which is $array[$idx]. And in case 4, we do a full symbol aliasing.

Function Prototypes

"So how can I code with style? I want to make functions that look like the built-in ones. How could I write a function like each(), except it would take two arrays, and iterate over them at the same time, returning the first element of each array, then the next, etc.?"

The column is "Coding with Style", you know.

The answer is to use function prototypes. They are a brief note to Perl to tell it how to treat the arguments it gets. You can also use a forward declaration of your function, so that you don't need parentheses when using strict; or, simply define your function before it gets used. Note to C/C++ programmers: prototypes do NOT name the variables to be used in the function (you must assign those normally, getting them from @_), and a function can only have one prototype.

For our first prototype, let's make a simple function that emulates Perl's scalar() function: sub my_scalar ($); # forward declaration $a = my_scalar localtime; # <-- semicolon! sub my_scalar ($) { # prototypes MUST match return $_[0]; } Notice the semicolon after the function's forward declaration! Anyway, what the ($) prototype means is that the function only gets one argument, and this argument will be put in scalar context. This is really all the scalar() function does. Sending the function a list of values will result in Perl whining at you that the function has gotten too many arguments, but sending it an array will return the number of elements in the array.

Prototypes can be made up the following symbols:
$
this argument is in scalar context
\$
this argument must be a simple scalar (not just a scalar value)
@
slurp the rest of the arguments as a list
\@
this argument must be a simple array
%
slurp the rest of the arguments as a list (does not check for an even number of elements!)
\%
this argument must be a real hash
*
this argument is interpreted as a reference to a glob
\*
this argument must be a real typeglob
&
argument must be an anonymous subroutine (prefixed with sub, unless it is the first argument)
;
the rest of the arguments after the semicolon are optional
It is important to realize that variables sent corresponding to a prototype such as \@ and \% are stored in @_ as references. By "simple scalar", that means you can pass $this, but not $that[0]. You can "get around" that by doing the ever nasty ${ \( $that[0] ) } but there's no reason to -- don't use prototypes in that case, or use a temporary variable.

So if you wanted to make your filehandle function look more like a regular Perl function, you could do: sub logit (*;$); # predeclare logit LOG; logit LOG, "some message"; sub logit (*;$) { my ($fh,$msg) = @_ my $now = localtime; $msg = $_ unless defined $msg; # for defaulting print $fh "[$now] $msg\n"; } That would tell Perl, even under strict, that logit LOG will work like we'd like it to. If we had printed the value of $fh, we'd see *main::LOG. We could rewrite the function to use local(): sub logit (*;$) { local *FH = shift; my $msg = @_ ? shift : $_; # another way... my $now = localtime; $msg = $_ unless defined $msg; # for defaulting print FH "[$now] $msg\n"; } This works as expected, which is A Good Thing.

Examples

Let's make the functions shown at the beginning of the article. # pop2 ARRAY - remove (and return) the # last two elements of an array # ex: ($a,$b) = pop2 @array; sub pop2 (\@) { # proto. demands an array my $aref = $_[0]; return splice(@$aref, -2); # or return (pop @$aref, pop @$aref); } This function looks like a regular function, and the magic of references goes on behind the scenes so you needn't send it \@array or *array like you would if you didn't use prototypes.

As for the one_each() function, you need to use a little trick. You store a variable that only the function sees, by putting the function in its own block, and using a my() variable that can't be seen outside of the block. This variable is used to hold the current index to be used in the array lookups. # one_each ARRAY, ARRAY [, LIMIT] - iterate over two # arrays at once, with an optional limit of iterations # returns an element of each array on each call # ex: while (($a,$b) = one_each(@A,@B,5)) { ... } { my ($pos,$lim,$as,$bs) = (0,0,0,0); sub one_each (\@\@;$) { # optional third arg my ($aref,$bref,$a,$b); ($aref,$bref) = @_; if ($pos == 0) { $lim = $_[2]; ($as,$bs) = (scalar @$aref, scalar @$bref); my ($min,$max) = ($as,$bs)[$as > $bs, $as < $bs]; $lim = $min if not defined $lim; $lim = $max if $lim > $max; $lim = int abs $lim; } # end iterations if ($pos == $lim) { ($pos,$lim,$as,$bs) = (0,0,0,0); # reset return; } ($a,$b) = ($aref->[$pos], $bref->[$pos]); $pos++; return ($a,$b); } } This function is very intelligent -- it uses four private variables: the index to retrieve ($pos), the maximum number of iterations ($lim), the size of the first array ($as), and the size of the second array ($bs). It only determines $as and $bs once, the first time. As a default, this function will only go through the arrays as many times as the shorter array has elements. Until it has gone through the prescribed (or default) number of times, it returns the next element of the two arrays it gets, and increments the position counter. When it is to stop, it resets all its private variables, and returns nothing, which evaluates as false in scalar context, and an empty list in list context.

This function is a good introduction to closures, to be explained in a future article. Closures are a way of using private data that remains in existence after it appears to go out of scope.

Traversing the Symbol Table

The following code will go through the main:: symbol table, and tell you all the scalars, arrays, hashes, and subroutines you have defined. It is paraphrased and borrowed from "Programming Perl" and the Perl documentation. for (sort keys %main::) { # do typeglob assignment, and treat main:: # like a hash, to do a dynamic symbol lookup local *sym = $main::{$_}; print "\$$_ defined\n" if defined $sym; print "\@$_ defined\n" if defined @sym; print "\%$_ defined\n" if defined %sym; print "\&$_ defined\n" if defined &sym; print "FH $_ defined\n" if defined *sym{IO}; } The reason we test for defined *sym{IO} is a special reason, and a special use of the *sym{THING} syntax. Remember that it returns a reference, not the actual data. Thus, defined *sym{SCALAR} would always be true, because \$sym is always defined, because it is a reference. However, because filehandles are not top-level data types in Perl they can be tested for definedness through a reference to the filehandle, returned via *sym{IO} syntax. This is because *sym{IO} returns a reference to an IO::Handle object, which is how Perl stores filehandles. Dirhandles and formats are still very difficult to get at from the symbol table; filehandles and dirhandles can be used via their glob or reference-to-glob form, but formats are a totally different story, but there does not appear to be a need to ever GET the format itself from the glob, and the glob form should work fine.

The code can be modified to start in main::, and work your way through any namespace and print all the defined variables. Quite an interesting thing to try on your own. Hint: keep track of what namespaces you've seen, so you don't keep going through main::.

Also, as the defined() documents say, defined(@array) will not return what you might expect it to. Testing for scalar @array might be more of what you want. In this case, though, defined(@array) is what I want to use.

Resources

To read about typeglobs, read perldata and perlref. Also, perlmod discusses typeglobs and symbol tables. For more on function prototypes, read the perlsub documentation.

This is all available on your computer (if you have perl) by running perldoc section, or going to http://language.perl.com/.