mod_perl Coding Guidelines - Part II

By: Stas Bekman


my() Scoped Variable in Nested Subroutines

Before we proceed, let's make a healthy assumption that we want to develop the code under strict pragma and avoid using global variables, thus using my() scoped variables whenever it's possible.


The Poison

Let's look at this code:

  nested.pl
  -----------
  #!/usr/bin/perl
  
  use strict;
  
  sub print_power_of_2 {
    my $x = shift;
  
    sub power_of_2 {
      return $x ** 2; 
    }
  
    my $result = power_of_2();
    print "$x^2 = $result\n";
  }
  
  print_power_of_2(5);
  print_power_of_2(6);

Don't let the weird subroutine names fool you, the print_power_of_2() subroutine should print the power of two of the passed number. Let's run the code and see whether it works:

  print_power_of_2(5);
  print_power_of_2(6);

And run it:

  % ./nested.pl
  
  5^2 = 25
  6^2 = 25

Ouch, something is wrong. May be there is a bug in Perl and it doesn't work correctly with number 6? Let's try again using the 5 and 7:

  print_power_of_2(5);
  print_power_of_2(7);

And run it:

  % ./nested.pl
  
  5^2 = 25
  7^2 = 25

Wow, does it works only for 5? How about using 3 and 5:

  print_power_of_2(3);
  print_power_of_2(5);

and the result is:

  % ./nested.pl
  
  3^2 = 9
  5^2 = 9

Now we start to understand--only the first call to the print_power_of_2() function works correctly. Which makes us think that our code has a memory for the results of first time execution and an ignorance of the arguments from consequent executions.


The Diagnosis

Let's follow the guidelines and use a -w flag. Now execute the code:

  % ./nested.pl
  
  Variable "$x" will not stay shared at ./nested.pl line 9.
  5^2 = 25
  6^2 = 25

We have never saw such a warning message before and we don't quite understand what it means. A diagnostics pragma will certainly help us. Let's prepend this pragma before the strict pragma in our code:

  #!/usr/bin/perl -w
  
  use diagnostics;
  use strict;

And execute it:

  % ./nested.pl
  
  Variable "$x" will not stay shared at ./nested.pl line 10 (#1)
    
    (W) An inner (nested) named subroutine is referencing a lexical
    variable defined in an outer subroutine.
    
    When the inner subroutine is called, it will probably see the value of
    the outer subroutine's variable as it was before and during the
    *first* call to the outer subroutine; in this case, after the first
    call to the outer subroutine is complete, the inner and outer
    subroutines will no longer share a common value for the variable.  In
    other words, the variable will no longer be shared.
    
    Furthermore, if the outer subroutine is anonymous and references a
    lexical variable outside itself, then the outer and inner subroutines
    will never share the given variable.
    
    This problem can usually be solved by making the inner subroutine
    anonymous, using the sub {} syntax.  When inner anonymous subs that
    reference variables in outer subroutines are called or referenced,
    they are automatically rebound to the current values of such
    variables.
    
  5^2 = 25
  6^2 = 25

Well, now everything is clear. We have the inner subroutine power_of_2() and the outer subroutine print_power_of_2() in our code.

When the inner power_of_2() subroutine is called for the first time, it sees the value of the outer print_power_of_2() subroutine's $x variable. On consequent calls the $x variable wouldn't be updated, no matter what was the value of it in the outer subroutine. That's why the $x variable is no longer be shared.


The Remedy

diagnostics pragma suggests using an anonymous subroutine (known also as closure). Let's rewrite the code to use this technique instead:

  anonymous.pl
  --------------
  #!/usr/bin/perl
  
  use strict;
  
  sub print_power_of_2 {
    my $x = shift;
  
    my $func_ref = sub {
      return $x ** 2;
    };
  
    my $result = &$func_ref();
    print "$x^2 = $result\n";
  }
  
  print_power_of_2(5);
  print_power_of_2(6);

Now $func_ref contains a reference to an anonymous function, which we later use when we need to get the power of two. Since the anonymous function will be generated afresh every time print_power_of_2() will be called the correct answer will given. Let's verify:

  % ./anonymous.pl
  
  5^2 = 25
  6^2 = 36

Indeed, it worked correctly as advertised.


When You Cannot Get Rid of Inner Subroutine

First you might wonder, why in the world someone will need to define an inner subroutine. For example to improve the efficiency of perl scripts starting overhead you decide to write a daemon that will compile that the scripts and modules only once and store the cached pre-compiled code in memory. When some script ought to be executed you just tell the daemon the name of the script to run and it will do the rest.

Seems like an easy task, and it is. The only problem is once the script is compiled, how do you execute it? Or let's put it the other way: after it was executed for the first time and it stays compiled in the daemon memory, how do you call it again? If you could enforce on developers to code the scripts so each will have a subroutine called run() that will actually execute the code in the script you have half of the problem solved.

But how daemon knows to refer to some specific script if they all run in the main:: name space? An obvious thing is to ask the developers to declare a package in each and every script, and for the package name to be derived from the script name. Moreover, since there is chance that there will be more than once script with the same name but residing in different directories, the directory has to be a part of the package name in order to prevent name-space collisions. And don't forget that script can be moved from directory to directory and you will have to make sure that the package name will be corrected every time the script gets moved.

But why enforce these strange rules on developers, when we can arrange for our daemon to do this work? For every script that daemon is about to execute for the first time, it should be wrapped inside the package whose name is constructed from the mangled path to the script and a subroutine called run(). For example if the daemon is about to execute the script /tmp/hello.pl:

  hello.pl
  --------
  #!/usr/bin/perl
  print "Hello\n";

Prior to running it, the daemon will change the code to be:

  wrapped_hello.pl
  ----------------
  package cache::tmp::hello_2epl;
  
  sub run{
    #!/usr/bin/perl 
    print "Hello\n";
  }

Where the package name is constructed from prefix cache::, each directories separation slash replaced with :: and non ASCII characters are encoded, so the . becomes _2e.

Now when the daemon is requested to execute the script /tmp/hello.pl, all it has to do is to build the package name as before based on the location of the script and call its run() subroutine:

  use cache::tmp::hello_2epl;
  cache::tmp::hello_2epl::run();

We have just written a partial prototype of the daemon we desired, the only not defined method is how to pass the path to the script to the daemon. This detail is left to the reader as an exercise.

If you are familiar with Apache::Registry module, you know that it works almost in the same way. It uses a different package prefix and the generic function is called handler() and not run(). The scripts to run are passed through the HTTP protocol's headers.

Now you understand that there are cases where your normal subroutines can become inner, since if your script was a simple:

  simple.pl
  ---------
  #!/usr/bin/perl 
  sub hello { print "Hello" }
  hello();

Wrapped into a run() subroutine it becomes:

  simple.pl
  ---------
  package cache::simple_2epl;
  
  sub run{
    #!/usr/bin/perl 
    sub hello { print "Hello" }
    hello();
  }

Therefore, hello() is an inner subroutine and if you have used my() scoped variables defined and altered outside and used inside hello(), it wouldn't work correctly starting from the second call, as was explained in the previous section.


Remedies working for Inner Subroutine

First of all there is nothing to worry about since if you do happen to have ``the my() scoped variable in the inner subroutine'' problem, Perl will always alert you if you don't forget to turn the warnings On.

Given that you have a script that has this problem. What are the ways to solve it? There are many of them and we will discuss some of them here.

We will the following code to show different solutions.

  multirun.pl
  -----------
  #!/usr/bin/perl -w
  
  use strict;
  
  for (1..3){
    print "run: [time $_]\n";
    run();
  }
  
  sub run {
  
    my $counter = 0;
  
    increment_counter();
    increment_counter();
  
    sub increment_counter{
      $counter++;
      print "Counter is equal to $counter !\n";
    }
  
  } # end of sub run

This code executes the run() subroutine three times, which in turn initializes the $counter variable to 0, every time it executed and then calls twice the increment_counter() inner subroutine that prints $counter's value after incrementing it. One might expect to see the following output:

  run: [time 1]
  Counter is equal to 1 !
  Counter is equal to 2 !
  run: [time 2]
  Counter is equal to 1 !
  Counter is equal to 2 !
  run: [time 3]
  Counter is equal to 1 !
  Counter is equal to 2 !

But as we have already learned from the previous sections, this is not what we are going to see. Indeed, when we run the script we see:

  % ./multirun.pl

  Variable "$counter" will not stay shared at ./nested.pl line 18.
  run: [time 1]
  Counter is equal to 1 !
  Counter is equal to 2 !
  run: [time 2]
  Counter is equal to 3 !
  Counter is equal to 4 !
  run: [time 3]
  Counter is equal to 5 !
  Counter is equal to 6 !

Obviously, the $counter variable is not reinitialized on each run() execution, therefore the $counter variable inside the increment_counter() subroutine preserves its previous value from the last execution and increments it to the next value.

One of the workarounds is to use globally declared variables, with the vars pragma.

  multirun1.pl
  -----------
  #!/usr/bin/perl -w
  
  use strict;
  use vars qw($counter);
  
  for (1..3){
    print "run: [time $_]\n";
    run();
  }
  
  sub run {
  
    $counter = 0;
  
    increment_counter();
    increment_counter();
  
    sub increment_counter{
      $counter++;
      print "Counter is equal to $counter !\n";
    }
  
  } # end of sub run

If you run this and other offered below solutions, the correct expected output will be generated:

  % ./multirun1.pl
  
  run: [time 1]
  Counter is equal to 1 !
  Counter is equal to 2 !
  run: [time 2]
  Counter is equal to 1 !
  Counter is equal to 2 !
  run: [time 3]
  Counter is equal to 1 !
  Counter is equal to 2 !

By the way, the warning we saw before has gone and so the problem, since there is no my() (lexically defined) variable used in the nested subroutine.

Another approach is to use fully qualified variables. This is a better one, since less memory will be used, but it adds a typing overhead:

  multirun2.pl
  -----------
  #!/usr/bin/perl -w
  
  use strict;
  
  for (1..3){
    print "run: [time $_]\n";
    run();
  }
  
  sub run {
  
    $main::counter = 0;
  
    increment_counter();
    increment_counter();
  
    sub increment_counter{
      $main::counter++;
      print "Counter is equal to $main::counter !\n";
    }
  
  } # end of sub run

You can also pass the variable to the subroutine by value and make the subroutine return it after it was updated. This adds time and memory overheads, so it's not a good idea if the variable can be very large.

Don't rely on the fact that the variable is small during the development of the application, it can grow quite big in situations you didn't expect. For example, a very simple HTML form text entry field can return a few megabytes of data if one of users is bored and want to test how good is your code. It's not uncommon to see user Copy-and-Paste core dump files of 10Mb in size into a form's text fields and submit it for your script to process.

  multirun3.pl
  -----------
  #!/usr/bin/perl -w
  
  use strict;
  
  for (1..3){
    print "run: [time $_]\n";
    run();
  }
  
  sub run {
  
    my $counter = 0;
  
    $counter = increment_counter($counter);
    $counter = increment_counter($counter);
  
    sub increment_counter{
      my $counter = shift || 0 ;
  
      $counter++;
      print "Counter is equal to $counter !\n";
  
      return $counter;
    }
  
  } # end of sub run

Finally, you can use references to do the job. increment_counter() accepts a reference to a $counter variable and increments its value by first dereferencing it. The $counter variable outside gets affected by this change as well.

  multirun4.pl
  -----------
  #!/usr/bin/perl -w
  
  use strict;
  
  for (1..3){
    print "run: [time $_]\n";
    run();
  }
  
  sub run {
  
    my $counter = 0;
  
    increment_counter(\$counter);
    increment_counter(\$counter);
  
    sub increment_counter{
      my $r_counter = shift || 0;
  
      $$r_counter++;
      print "Counter is equal to $$r_counter !\n";
    }
  
  } # end of sub run

Here is yet another even more obscure reference usage. We modify the value of $counter inside the subroutine by using the fact that variables in @_ are actually aliases, so if you directly modify one of the members of the array the actual value of the passed variable gets changed.

  multirun5.pl
  -----------
  #!/usr/bin/perl -w
  
  use strict;
  
  for (1..3){
    print "run: [time $_]\n";
    run();
  }
  
  sub run {
  
    my $counter = 0;
  
    increment_counter($counter);
    increment_counter($counter);
  
    sub increment_counter{
      $_[0]++;
      print "Counter is equal to $_[0] !\n";
    }
  
  } # end of sub run

Now you have at least five workarounds to choose from.

For more information please refer to perlref and perlsub manpages.


use(), require(), do(), %INC and @INC Explained


The @INC array

@INC is a special Perl variable which is an equivalent of the shell's PATH variable. While PATH includes a list of directories the executables are being looked up in, @INC contains a list of directories Perl modules and libraries can be loaded from.

When you use(), require() or do() a filename or a module, Perl gets a list of directories from the @INC variable to search for the file it was requested to load. If the file that you want to load is not located in one of the listed directories, you have to tell Perl where to find the file by providing it a relative path to one of the directories in @INC or a full path to the file.


The %INC hash

%INC is another special Perl variable that is used to cache the names of the files and the modules that were successfully loaded and compiled by use(), require() or do() functions. Before attempting to load a file or a module, Perl checks whether it's already in %INC hash. If it's there--the loading and therefore the loaded code compilation are not performed at all. Otherwise the file is loaded in memory and attempted to be compiled.

If the file is successfully loaded and compiled, a new key-value pair is added to %INC, where the key is the name of the file or module as it passed to the one of the three functions we have just mentioned, and the value is a full path to it in the file system if it was found in any of the @INC directories, but ".".

The following examples will make it easier to understand a described logic.

First, let's see what are the contents of @INC on my system:

  % perl -e 'print join "\n", @INC'
  /usr/lib/perl5/5.00503/i386-linux
  /usr/lib/perl5/5.00503
  /usr/lib/perl5/site_perl/5.005/i386-linux
  /usr/lib/perl5/site_perl/5.005
  .

Notice the . (current directory) as a last directory in the list.

Now let's load a module strict.pm and see the contents of %INC:

  % perl -e 'use strict; print map {"$_ => $INC{$_}\n"} keys %INC'
  
  strict.pm => /usr/lib/perl5/5.00503/strict.pm

Since strict.pm was found in /usr/lib/perl5/5.00503/ directory and /usr/lib/perl5/5.00503/ is a part of @INC--%INC includes a full path as a value for the key strict.pm.

Now let's create the simplest module in /tmp/test.pm:

  test.pm
  -------
  1;

It does nothing, but returns a true value when loaded. Now let's load it in different ways:

  % cd /tmp
  % perl -e 'use test; print map {"$_ => $INC{$_}\n"} keys %INC'
  
  test.pm => test.pm

Since the file was found relative to . (current directory) the relative path is inserted as a value, but if we alter the @INC, by adding the /tmp to the end:

  % cd /tmp
  % perl -e 'BEGIN{push @INC, "/tmp"} use test; \
  print map {"$_ => $INC{$_}\n"} keys %INC'
  
  test.pm => test.pm

we still get the relative path, since the module was found first relative to ".", because the /tmp was after . in the list. But if we execute the same code from a different directory and therefore the "." directory wouldn't match:

  % cd /
  % perl -e 'BEGIN{push @INC, "/tmp"} use test; \
  print map {"$_ => $INC{$_}\n"} keys %INC'
  
  test.pm => /tmp/test.pm

we get the full path. We can also prepend the path with unshift(), so it will be used for matching before "." and therefore we get a full path as well.

  % cd /tmp
  % perl -e 'BEGIN{unshift @INC, "/tmp"} use test; \
  print map {"$_ => $INC{$_}\n"} keys %INC'
  
  test.pm => /tmp/test.pm

  BEGIN{unshift @INC, "/tmp"}

can be replaced with more elegant:

  use lib "/tmp";

Which executes exactly the BEGIN block from above.

These approaches to modifying @INC can be labor intensive, since if you want to move the script around in the file-system you have to modify the path. This can be painful, for example, when you move your scripts from development to a production server.

There is a FindBin module, which solves this problem is the plain perl world, but unfortunately it doesn't work correctly under mod_perl.

If you use this module, you don't need to write a hard coded path. The following snippet does all the work for you (the file is /tmp/load.pl):

  load.pl
  -------
  #!/usr/bin/perl
  
  use FindBin ();
  use lib "$FindBin::Bin";
  use test;
  print "test.pm => $INC{'test.pm'}\n";

In the above example $FindBin::Bin equals to /tmp. If we move the script somewhere else... e.g. /tmp/x in the code above $FindBin::Bin equals to /home/x.

  % /tmp/load.pl
  
  test.pm => /tmp/test.pm

Just like with use lib but no hard coded path required.

As I've mentioned earlier, FindBin will not work in mod_perl environment, since it's a module and as any module it's loaded only once. So the first script using it will have all the settings correct, but the rest of the scripts will not if located in a different directory than the first one.


Modules, Libraries and Files

Before we proceed let's define what do we mean by module and library or file.


require()

What require() does is reading a file with Perl code and compiles it. Before attempting to load the file it looks up its argument in %INC to see whether it was already loaded. If it was, require() just returns without doing a thing. Otherwise the file will be attempted to be loaded and compiled.

require() has to find the file, is has to load. If the argument is a full path to the file, it just tries to read it. For example:

  require "/home/httpd/perl/mylibs.pl";

If the path is relative, require() will attempt to search for the file in all the directories listed in @INC. For example:

  require "mylibs.pl";

If there is more than one occurrence of the file with the same name, in directories listed in @INC the first occurrence will be used.

The file must return TRUE as the last statement to indicate successful execution of any initialization code. Since you never know what changes the file will go through in the future, you cannot be sure that the last statement will always return TRUE. That's why the suggestion is to put ``1;'' at the end of file.

While you should use the real filename for most of the files. If the file is a module, you may use the following convention instead:

  require My::Module;

This is equal to:

  require "My/Module.pm";

If require() fails to load the file, either because it couldn't find the file in question, the code failed to compile and didn't return TRUE at the end, the program would die(), unless the require() statement would be enclosed into an eval() block, like in this example:

  require.pl
  ----------
  #!/usr/bin/perl -w
  
  eval { require "/file/that/does/not/exists"};
  if ($@) {
    print "Failed to load, because : $@"
  }
  print "\nHello\n";

When we execute the program:

  % ./require.pl
  
  Failed to load, because : Can't locate /file/that/does/not/exists in
  @INC (@INC contains: /usr/lib/perl5/5.00503/i386-linux
  /usr/lib/perl5/5.00503 /usr/lib/perl5/site_perl/5.005/i386-linux
  /usr/lib/perl5/site_perl/5.005 .) at require.pl line 3.
  
  Hello

We see that the program didn't die(), because Hello was printed. This trick is useful when you want to check whether a user has some module installed, but if she hasn't--it's not so critical, may be the program runs without this module with a reduced set of functionality.

If we remove the eval() part and try again:

  require.pl
  ----------
  #!/usr/bin/perl -w
  
  require "/file/that/does/not/exists";
  print "\nHello\n";

  % ./require1.pl
  
  Can't locate /file/that/does/not/exists in @INC (@INC contains:
  /usr/lib/perl5/5.00503/i386-linux /usr/lib/perl5/5.00503
  /usr/lib/perl5/site_perl/5.005/i386-linux
  /usr/lib/perl5/site_perl/5.005 .) at require1.pl line 3.

The program just die()s in the last example, which is what you want in most of the cases.

For more information referrer to perlfunc manpage.


use()

use() just like require() loads and compiles the files with Perl code, but it works with modules only. Thus the only way to pass a module to load is by its name and not a filename. If the module located in MyCode.pm, the correct way to use() it is:

  use MyCode

and not:

  use "MyCode.pm"

What use() does is translating of the passed argument into a file name replacing :: with / and appending .pm at the end. So My::Module becomes My/Module.pm.

use() is exactly equivalent to:

 BEGIN { require Module; import Module LIST; }

Internally it calls to require() to do the loading and compilation chores, when the former finishes its job, the import() is being called, unless () is a second argument. The following pairs are equivalent:

  use MyModule;
  BEGIN {require MyModule; import MyModule; }
  
  use MyModule qw(foo bar);
  BEGIN {require MyModule; import MyModule ("foo","bar"); }
  
  use MyModule ();
  BEGIN {require MyModule; }

When non of the parameters passed to import() it imports the default symbols if such were defined inside the module. The import() is not a builtin function--it's just an ordinary static method call into the ``MyModule'' package to tell the module to import the list of features back into the current package. See the Exporter manpage for more information.

There's a corresponding ``no'' command that un-imports symbols imported by use, i.e., it calls unimport Module LIST instead of import().


do()

While do() behaves almost identically to require(), it reloads the file unconditionally. It doesn't check %INC to see whether the file was already loaded.

If do() cannot read the file, it returns undef and sets $! to report the error. If do() can read the file but cannot compile it, it returns undef and sets an error message in $@. If the file is successfully compiled, do() returns the value of the last expression evaluated.


Next month

Next month we will get to the real stuff and start looking how one should code in mod_perl, which techniques should be deployed and which should be avoided.