At the beginning there were 2 servers: one - plain apache server, which was very light, and configured to serve static objects, the other -- mod_perl enabled,
which was very heavy and aimed to serve mod_perl scripts. We named them: httpd_docs and httpd_perl
appropriately. The two servers coexisted at the same IP (DNS) by listening
to different ports: 80 -- for httpd_docs
(e.g. http://www.nowhere.com/images/test.gif
) and 8080 -- for
httpd_perl (e.g. http://www.nowhere.com:8080/perl/test.pl
). Note that I did not write http://www.nowhere.com:80 for the
first example, since port 80 is a default http port. (Later on, I will be
moving the
httpd_docs server to port 81.)
Now I am going to convince you that you want to use a proxy server (in the http accelerator mode). The advantages are:
Allow serving of static objects from the proxy's cache (objects that
previously were entirely served by the httpd_docs server).
You get less I/O activity reading static objects from the disk (proxy serves the most ``popular'' objects from the RAM memory - of course you benefit more if you allow the proxy server to consume more RAM). Since you do not wait for the I/O to be completed you are able to serve the static objects much faster.
The proxy server acts as a sort of output buffer for the dynamic content. The mod_perl server sends the entire response to the proxy and is then free to deal with other requests. The proxy server is responsible for sending the response to the browser. So if the transfer is over a slow link, the mod_perl server is not waiting around for the data to move.
Using numbers is always more convincing :) Let's take a user connected to your site with 28.8 kbps (bps == bits/sec) modem. It means that the speed of the user's link is 28.8/8 = 3.6 kbytes/sec. I assume an average generated HTML page to be of 10kb (kb == kilobytes) and an average script that generates this output in 0.5 secs. How much time will the server wait before the user gets the whole output response? A simple calculation reveals pretty scary numbers - it will have to wait for another 6 secs (20kb/3.6kb), when it could serve another 12 (6/0.5) dynamic requests in this time. This very simple example shows us that we need a twelve the number of children running, which means you will need only one twelve of the memory (which is not quite true because some parts of the code are being shared). But you know that nowadays scripts return pages which sometimes are being blown up with javascript code and similar, which makes them of 100kb size and download time to be of... (This calculation is left to you as an exercise :)
To make your estimation of download time numbers even worse, let me remind you that many users like to open many browser windows and do many things at once (download files and browse heavy sites). So the speed of 3.6kb/sec we were assuming before, may many times be 5-10 times slower.
Also we are going to hide the details of the server's implementation. Users will never see ports in the URLs. And you can have a few boxes serving the requests, and only one serving as a front end, which spreads the jobs between the servers in a way you configured it too. So you can actually put down one server down for upgrade, but end user will never notice that because the front end server will dispatch the jobs to other servers. (Of course this is a pretty big issue, and it would not be discussed in the scope of this document)
For security reasons, using any httpd accelerator (or a proxy in httpd accelerator mode) is essential because you do not let your internal server get directly attacked by arbitrary packets from whomever. The httpd accelerator and internal server communicate in expected HTTP requests. This allows for only your public ``bastion'' accelerating www server to get hosed in a successful attack, while leaving your internal data safe.
The disadvantages are:
Of course there are drawbacks. Luckily, these are not functionality drawbacks, but more of administration hassle. You add another daemon to worry about, and while proxies are generally stable, you have to make sure to prepare proper startup and shutdown scripts, which are being run at the boot and reboot appropriately. Also, maybe a watchdog script running at the crontab.
Proxy servers can be configured to be light or heavy, the admin must decide what gives the highest performance for his application. A proxy server like squid is light in the concept of having only one process serving all requests. But it can appear pretty heavy when it loads objects into memory for faster service.
Have I succeeded in convincing you that you want the proxy server?
If you are on a local area network (LAN), then the big benefit of the proxy buffering the output and feeding a slow client is gone. You are probably better off sticking with a straight mod_perl server in this case.
As of this writing the two proxy implementations are known to be used in bundle with mod_perl - squid proxy server and mod_proxy which is a part of the apache server. This month we will talk about apache's mod_proxy:
I do not think the difference in speed between apache's mod_proxy and squid is relevant for most sites, since the real value of what they do is buffering for slow client connections. However squid runs as a single process and probably consumes fewer system resources. The trade-off is that mod_rewrite is easy to use if you want to spread parts of the site across different back end servers, and mod_proxy knows how to fix up redirects containing the back-end server's idea of the location. With squid you can run a redirector process to proxy to more than one back end, but there is a problem in fixing redirects in a way that keeps the client's view of both server names and port numbers in all cases. The difficult case being where you have DNS aliases that map to the same IP address for an alias and you want the redirect to use port 80 (when the server is really on a different port) but you want it to keep the specific name the browser sent so it does not change in the client's Location window.
The Advantages:
No additional server is needed. We keep the one plain plus one mod_perl
enabled apache servers. All you need is to enable the
mod_proxy in the httpd_docs server and add a few lines to
httpd.conf file.
ProxyPass and ProxyPassReverse directives allow you to hide the internal redirects, so if http://nowhere.com/modperl/ is actually
http://localhost:81/modperl/, it will be absolutely transparent for user. ProxyPass redirects the request to the mod_perl server, and when it gets the respond, ProxyPassReverse rewrites the URL back to the original one, e.g:
ProxyPass /modperl/ http://localhost:81/modperl/ ProxyPassReverse /modperl/ http://localhost:81/modperl/
It does mod_perl output buffering.
It even does caching. You have to produce correct Content-Length,
Last-Modified and Expires http headers for it to work. If some dynamic content is not to change
constantly, you can dramatically increase performance by caching it with ProxyPass.
ProxyPass happens before the authentication phase, so you do not have to worry about
authenticating twice.
Apache is able to accel https (secure) requests completely, while also doing http accel. (with squid you have to use an external redirection program for that).
The latest (from apache 1.3.6) Apache proxy accel mode reported to be very stable.
The Disadvantages:
Users reported that it might be a bit slow, but the latest version is fast enough. (How fast is enough? :)
To build it into apache just add --enable-module=proxy during the apache configure stage.
Now we will talk about apache's mod_proxy and understand how it works.
The server on port 80 answers http requests directly and proxies the mod_perl enabled server in the following way:
ProxyPass /modperl/ http://localhost:81/modperl/ ProxyPassReverse /modperl/ http://localhost:81/modperl/
PPR is the saving grace here, that makes apache a win over Squid. It rewrites
the redirect on its way back to the original URI.
You can control the buffering feature with ProxyReceiveBufferSize
directive:
ProxyReceiveBufferSize 1048576
The above setting will set a buffer size to be of 1Mb. If it is not set
explicitly, then the default buffer size is used, which depends on OS, for
Linux I suspect it is somewhere below 32k. So basically to get an immediate
release of the mod_perl server from stale awaiting,
ProxyReceiveBufferSize should be set to a value greater than the biggest generated respond
produced by any mod_perl script.
The ProxyReceiveBufferSize directive specifies an explicit buffer size for outgoing HTTP and FTP connections. It has to be greater than 512 or set to 0 to
indicate that the system's default buffer size should be used.
As the name states, its buffering feature applies only to downstream
data (coming from the origin server to the proxy) and not upstream (i.e.
buffering the data being uploaded from the client browser to the proxy,
thus freeing the httpd_perl origin server from being tied up during a large POST such as a file
upload).
Apache does caching as well. It's relevant to mod_perl only if you produce proper headers, so your scripts' output can be cached. See apache documentation for more details on configuration of this capability.
Ask Bjoern Hansen has written a mod_proxy_add_forward module for apache, that sets the X-Forwarded-For field when doing a
ProxyPass, similar to what squid can do. (Its location is specified in the help
section). Basically, that module adds an extra HTTP header to proxying
requests. You can access that header in the mod_perl-enabled server, and
set the IP of the remote server. You won't need to compile anything into
the back-end server, if you are using Apache::{Registry,PerlRun} just put something like the following into start-up.pl:
sub My::ProxyRemoteAddr ($) {
my $r = shift;
# we'll only look at the X-Forwarded-For header if the requests
# comes from our proxy at localhost
return OK unless ($r->connection->remote_ip eq "127.0.0.1");
if (my ($ip) = $r->header_in('X-Forwarded-For') =~ /([^,\s]+)$/) {
$r->connection->remote_ip($ip);
}
return OK;
}
And in httpd.conf:
PerlPostReadRequestHandler My::ProxyRemoteAddr
Different sites have different needs. If you're using the header to set the
IP address, apache believes it is dealing with (in the logging and stuff),
you really don't want anyone but your own system to set the header. That's
why the above ``recommended code'' checks where the request is really
coming from, before changing the remote_ip.
From that point on, the remote IP address is correct. You should be able to
access REMOTE_ADDR as usual.
You could do the same thing with other environment variables (though I think several of them are preserved, you will want to run some tests to see which ones).
Assuming that you have a setup of one ``front-end'' server, which proxies the ``back-end'' (mod_perl) server, if you need to perform the authentication in the ``back-end'' server, it should handle all authentication itself. If apache proxies correctly, it seems like it would pass through all authentication information, making the ``front-end'' apache somewhat ``dumb'', as it does nothing, but passes through all the information.
The only possible caveat in the config file is that your Auth stuff needs to be in <Directory ...> ... </Directory> tags because if you use a <Location /...> ... </Location> the proxypass server takes the auth info for its own authentication
and would not pass it on.
Next month I'll continue talking about proxy servers and will present the squid proxy server: its drawbacks and benefits, configuration details.