Mod Perl Icon Mod Perl Icon Performance Tuning


Table of Contents:

[ TOC ]
The Writing Apache Modules with Perl and C book can be purchased online from O'Reilly and Amazon.com.
Your corrections of the technical and grammatical errors are very welcome. You are encouraged to help me improve this guide. If you have something to contribute please send it directly to me.
[ TOC ]

The Big Picture

To make the user's Web browsing experience as painless as possible, every effort must be made to wring the last drop of performance from the server. There are many factors which affect Web site usability, but speed is one of the most important. This applies to any webserver, not just Apache, so it is very important that you understand it.

How do we measure the speed of a server? Since the user (and not the computer) is the one that interacts with the Web site, one good speed measurement is the time elapsed between the moment when she clicks on a link or presses a Submit button to the moment when the resulting page is fully rendered.

The requests and replies are broken into packets. A request may be made up of several packets, a reply may be many thousands. Each packet has to make its own way from one machine to another, perhaps passing through many interconnection nodes. We must measure the time starting from when the first packet of the request leaves our user's machine to when the last packet of the reply arrives back there.

A webserver is only one of the entities the packets see along their way. If we follow them from browser to server and back again, they may travel by different routes through many different entities. Before they are processed by your server the packets might have to go through proxy (accelerator) servers and if the request contains more than one packet, packets might arrive to the server by different routes with different arrival times, therefore it's possible that some packets that arrive earlier will have to wait for other packets before they could be reassembled into a chunk of the request message that will be then read by the server. Then the whole process is repeated in reverse.

You could work hard to fine tune your webserver's performance, but a slow Network Interface Card (NIC) or a slow network connection from your server might defeat it all. That's why it's important to think about the Big Picture and to be aware of possible bottlenecks between the server and the Web.

Of course there is little that you can do if the user has a slow connection. You might tune your scripts and webserver to process incoming requests ultra quickly, so you will need only a small number of working servers, but you might find that the server processes are all busy waiting for slow clients to accept their responses.

But there are techniques to cope with this. For example you can deliver the respond after it was compressed. If you are delivering a pure text respond--gzip compression will reduce the size of the message by 10 times.

You should analyze all the involved components when you try to create the best service for your users, and not the web server or the code that the web server executes. A Web service is like a car, if one of the parts or mechanisms is broken the car may not go smoothly and it can even stop dead if pushed too far without first fixing it.

And let me stress it again--If you want to have a success in the web service business you should start worrying about the client's browsing experience and not only how good your code benchmarks are.

[ TOC ]


System Analysis

Before we try to solve a problem we need to indentify it. In our case we want to get the best performance we can with as little monetary and time investment as possible.

[ TOC ]


Software Requirements

Covered in the section ``Choosing an Operating System''.

[ TOC ]


Hardware Requirements

(META: Only partial analysis. Please submit more points. Many points are scattered around the document and should be gathered here, to represent the whole picture. It also should be merged with the above item!)

You need to analyze all of the problem's dimensions. There are several things that need to be considered:

The first one is probably the easiest to optimize. Following the performance optimization tips in this and other documents allows a perl (mod_perl) programmer to exercise their code and improve it.

The second one is a function of RAM. How much RAM is in each box, how many boxes do you have, and how much RAM does each mod_perl process use? Multiply the first two and divide by the third. Ask yourself whether it is better to switch to another, possibly just as inefficient language or whether that will actually cost more than throwing another powerful machine into the rack.

Also ask yourself whether switching to another language will even help. In some applications, for example to link Oracle runtime libraries, a huge chunk of memory is needed so you would save nothing even if you switched from Perl to C.

The last two are important. You need a realistic estimate. Are you really expecting 8 million hits per day? What is the expected peak load, and what kind of response time do you need to guarantee? Remember that these numbers might change drastically when you apply code changes and your site becomes popular. Remember that when you get a very high hit rate, the resource requirements don't grow linearly but exponentially!

More coverage is provided in the section ``Choosing Hardware''.

[ TOC ]


Essential Tools

In order to improve performance we need measurement tools. The main tool categories are benchmarking and code profiling.

[ TOC ]


Benchmarking Applications

How much faster is mod_perl than mod_cgi (aka plain perl/CGI)? There are many ways to benchmark the two. I'll present a few examples and numbers below. Check out the benchmark directory of the mod_perl distribution for more examples.

If you are going to write your own benchmarking utility, use the Benchmark module for heavy scripts and the Time::HiRes module for very fast scripts (faster than 1 sec) where you will need better time precision.

There is no need to write a special benchmark though. If you want to impress your boss or colleagues, just take some heavy CGI script you have (e.g. a script that crunches some data and prints the results to STDOUT), open 2 xterms and call the same script in mod_perl mode in one xterm and in mod_cgi mode in the other. You can use lwp-get from the LWP package to emulate the browser. The benchmark directory of the mod_perl distribution includes such an example.

See also two tools for benchmarking: ApacheBench and crashme test

[ TOC ]


Developers Talk

Perrin Harkins writes on benchmarks or comparisons, official or unofficial:

I have used some of the platforms you mentioned and researched others. What I can tell you for sure, is that no commercially available system offers the depth, power, and ease of use that mod_perl has. Either they don't let you access the web server internals, or they make you use less productive languages than Perl, sometimes forcing you into restrictive and confusing APIs and/or GUI development environments. None of them offers the level of support available from simply posting a message to [the mod-perl] list, at any price.

As for performance, beyond doing several important things (code-caching, pre-forking/threading, and persistent database connections) there isn't much these tools can do, and it's mostly in your hands as the developer to see that the things which really take the time (like database queries) are optimized.

The downside of all this is that most manager types seem to be unable to believe that web development software available for free could be better than the stuff that cost $25,000 per CPU. This appears to be the major reason most of the web tools companies are still in business. They send a bunch of suits to give PowerPoint presentations and hand out glossy literature to your boss, and you end up with an expensive disaster and an approaching deadline.

But I'm not bitter or anything...

Jonathan Peterson adds:

Most of the major solutions have something that they do better than the others, and each of them has faults. Microsoft's ASP has a very nice objects model, and has IMO the best data access object (better than DBI to use - but less portable). It has the worst scripting language. PHP has many of the advantages of Perl-based solutions, and is less complicated for developers. Netscape's Livewire has a good object model too, and provides good server-side Java integration - if you want to leverage Java skills, it's good. Also, it has a compiled scripting language - which is great if you aren't selling your clients the source code (and a pain otherwise).

mod_perl's advantage is that it is the most powerful. It offers the greatest degree of control with one of the more powerful languages. It also offers the greatest granularity. You can use an embedding module (eg eperl) from one place, a session module (Session) from another, and your data access module from yet another.

I think the Apache::ASP module looks very promising. It has very easy to use and adequately powerful state maintenance, a good embedding system, and a sensible object model (that emulates the Microsoft ASP one). It doesn't replicate MS's ADO for data access, but DBI is fine for that.

I have always found that the developers available make the greatest impact on the decision. If you have a team with no Perl experience, and a small or medium task, using something like PHP, or Microsoft ASP makes more sense than driving your staff into the vertical learning curve they'll need to use mod_perl.

For very large jobs, it may be worth finding the best technical solution, and then recruiting the team with the necessary skills.

[ TOC ]


Benchmarking a Graphic Hits Counter with Persistent DB Connections

Here are the numbers from Michael Parker's mod_perl presentation at the Perl Conference (Aug, 98). (Sorry, there used to be links here to the source, but they went dead one day, so I removed them). The script is a standard hits counter, but it logs the counts into a mysql relational DataBase:

 
    Benchmark: timing 100 iterations of cgi, perl...  [rate 1:28]
    
    cgi: 56 secs ( 0.33 usr 0.28 sys = 0.61 cpu) 
    perl: 2 secs ( 0.31 usr 0.27 sys = 0.58 cpu) 
    
    Benchmark: timing 1000 iterations of cgi,perl...  [rate 1:21]
     
    cgi: 567 secs ( 3.27 usr 2.83 sys = 6.10 cpu) 
    perl: 26 secs ( 3.11 usr 2.53 sys = 5.64 cpu)      
    
    Benchmark: timing 10000 iterations of cgi, perl   [rate 1:21]
     
    cgi: 6494 secs (34.87 usr 26.68 sys = 61.55 cpu) 
    perl: 299 secs (32.51 usr 23.98 sys = 56.49 cpu) 

We don't know what server configurations were used for these tests, but I guess the numbers speak for themselves.

The source code of the script was available at http://www.realtime.net/~parkerm/perl/conf98/sld006.htm. It's now a dead link. If you know its new location, please let me know.

[ TOC ]


Benchmarking Scripts with Execution Times Below 1 Second

If you want to get the benchmark results in micro-seconds you will have to use the Time::HiRes module, its usage is similar to Benchmark's.

 
  use Time::HiRes qw(gettimeofday tv_interval);
  my $start_time = [ gettimeofday ];
  sub_that_takes_a_teeny_bit_of_time();
  my $end_time = [ gettimeofday ];
  my $elapsed = tv_interval($start_time,$end_time);
  print "The sub took $elapsed seconds."

See also the crashme test.

[ TOC ]


Benchmarking PerlHandlers

The Apache::Timeit module does PerlHandler Benchmarking. With the help of this module you can log the time taken to process the request, just like you'd use the Benchmark module to benchmark a regular Perl script. Of course you can extend this module to perform more advanced processing like putting the results into a database for a later processing. But all it takes is adding this configuration directive inside httpd.conf:

 
  PerlFixupHandler Apache::Timeit

Since scripts running under Apache::Registry are running inside the PerlHandler these are benchmarked as well.

An example of the lines which show up in the error_log file:

 
  timing request for /perl/setupenvoff.pl:
    0 wallclock secs ( 0.04 usr +  0.01 sys =  0.05 CPU)
  timing request for /perl/setupenvoff.pl:
    0 wallclock secs ( 0.03 usr +  0.00 sys =  0.03 CPU)

The Apache::Timeit package is a part of the Apache-Perl-contrib files collection available from CPAN.

[ TOC ]


Code Profiling Techniques

The profiling process helps you to determine which subroutines or just snippets of code take the longest time to execute and which subroutines are called most often. Probably you will want to optimize those.

When do you need to profile your code? You do that when you suspect that some part of your code is called very often and may be there is a need to optimize it to significantly improve the overall performance.

For example if you have ever used the diagnostics pragma, which extends the terse diagnostics normally emitted by both the Perl compiler and the Perl interpreter, augmenting them with the more verbose and endearing descriptions found in the perldiag manpage. You know that it might tremendously slow you code down, so let's first prove that it is correct.

We will run a benchmark, once with diagnostics enabled and once disabled, on a subroutine called test_code.

The code inside the subroutine does an arithmetic and a numeric comparison of two strings. It assigns one string to another if the condition tests true but the condition always tests false. To demonstrate the diagnostics overhead the comparison operator is intentionally wrong. It should be a string comparison, not a numeric one.

 
  use Benchmark;
  use diagnostics;
  use strict;
  
  my $count = 50000;
  
  disable diagnostics;
  my $t1 = timeit($count,\&test_code);
  
  enable  diagnostics;
  my $t2 = timeit($count,\&test_code);
  
  print "Off: ",timestr($t1),"\n";
  print "On : ",timestr($t2),"\n";
  
  sub test_code{
    my ($a,$b) = qw(foo bar);
    my $c;
    if ($a == $b) {
      $c = $a;
    }
  }

For only a few lines of code we get:

 
  Off:  1 wallclock secs ( 0.81 usr +  0.00 sys =  0.81 CPU)
  On : 13 wallclock secs (12.54 usr +  0.01 sys = 12.55 CPU)

With diagnostics enabled, the subroutine test_code() is 16 times slower, than with diagnostics disabled!

Now let's fix the comparison the way it should be, by replacing == with eq, so we get:

 
    my ($a,$b) = qw(foo bar);
    my $c;
    if ($a eq $b) {
      $c = $a;
    }

and run the same benchmark again:

 
  Off:  1 wallclock secs ( 0.57 usr +  0.00 sys =  0.57 CPU)
  On :  1 wallclock secs ( 0.56 usr +  0.00 sys =  0.56 CPU)

Now there is no overhead at all. The diagnostics pragma slows things down only when warnings are generated.

After we have verified that using the diagnostics pragma might adds a big overhead to execution runtime, let's use the code profiling to understand why this happens. We are going to use Devel::DProf to profile the code. Let's use this code:

 
  diagnostics.pl
  --------------
  use diagnostics;
  print "Content-type:text/html\n\n";
  test_code();
  sub test_code{
    my ($a,$b) = qw(foo bar);
    my $c;
    if ($a == $b) {
      $c = $a;
    }
  }

Run it with the profiler enabled, and then create the profiling stastics with the help of dprofpp:

 
  % perl -d:DProf diagnostics.pl
  % dprofpp
  
  Total Elapsed Time = 0.342236 Seconds
    User+System Time = 0.335420 Seconds
  Exclusive Times
  %Time ExclSec CumulS #Calls sec/call Csec/c  Name
   92.1   0.309  0.358      1   0.3089 0.3578  main::BEGIN
   14.9   0.050  0.039   3161   0.0000 0.0000  diagnostics::unescape
   2.98   0.010  0.010      2   0.0050 0.0050  diagnostics::BEGIN
   0.00   0.000 -0.000      2   0.0000      -  Exporter::import
   0.00   0.000 -0.000      2   0.0000      -  Exporter::export
   0.00   0.000 -0.000      1   0.0000      -  Config::BEGIN
   0.00   0.000 -0.000      1   0.0000      -  Config::TIEHASH
   0.00   0.000 -0.000      2   0.0000      -  Config::FETCH
   0.00   0.000 -0.000      1   0.0000      -  diagnostics::import
   0.00   0.000 -0.000      1   0.0000      -  main::test_code
   0.00   0.000 -0.000      2   0.0000      -  diagnostics::warn_trap
   0.00   0.000 -0.000      2   0.0000      -  diagnostics::splainthis
   0.00   0.000 -0.000      2   0.0000      -  diagnostics::transmo
   0.00   0.000 -0.000      2   0.0000      -  diagnostics::shorten
   0.00   0.000 -0.000      2   0.0000      -  diagnostics::autodescribe

It's not easy to see what is responsible for this enormous overhead, even if main::BEGIN seems to be running most of the time. To get the full picture we must see the OPs tree, which shows us who calls whom, so we run:

 
  % dprofpp -T

and the output is:

 
 main::BEGIN
   diagnostics::BEGIN
      Exporter::import
         Exporter::export
   diagnostics::BEGIN
      Config::BEGIN
      Config::TIEHASH
      Exporter::import
         Exporter::export
   Config::FETCH
   Config::FETCH
   diagnostics::unescape
   .....................
   3159 times [diagnostics::unescape] snipped
   .....................
   diagnostics::unescape
   diagnostics::import
 diagnostics::warn_trap
   diagnostics::splainthis
      diagnostics::transmo
      diagnostics::shorten
      diagnostics::autodescribe
 main::test_code
   diagnostics::warn_trap
      diagnostics::splainthis
         diagnostics::transmo
         diagnostics::shorten
         diagnostics::autodescribe
   diagnostics::warn_trap
      diagnostics::splainthis
         diagnostics::transmo
         diagnostics::shorten
        diagnostics::autodescribe

So we see that two executions of diagnostics::BEGIN and 3161 of diagnostics::unescape are responsible for most of the running overhead.

META: but we see that it might be run only once in mod_perl, so the numbers are better. Am I right? check it!

If we comment out the diagnostics module, we get:

 
  Total Elapsed Time = 0.079974 Seconds
    User+System Time = 0.059974 Seconds
  Exclusive Times
  %Time ExclSec CumulS #Calls sec/call Csec/c  Name
   0.00   0.000 -0.000      1   0.0000      -  main::test_code

It is possible to profile code running under mod_perl with the Devel::DProf module, available on CPAN. However, you must have apache version 1.3b3 or higher and the PerlChildExitHandler enabled during the httpd build process. When the server is started, Devel::DProf installs an END block to write the tmon.out file. This block will be called at server shutdown. Here is how to start and stop a server with the profiler enabled:

 
  % setenv PERL5OPT -d:DProf
  % httpd -X -d `pwd` &
  ... make some requests to the server here ...
  % kill `cat logs/httpd.pid`
  % unsetenv PERL5OPT
  % dprofpp

The Devel::DProf package is a Perl code profiler. It will collect information on the execution time of a Perl script and of the subs in that script (remember that print() and map() are just like any other subroutines you write, but they come bundled with Perl!)

Another approach is to use Apache::DProf, which hooks Devel::DProf into mod_perl. The Apache::DProf module will run a Devel::DProf profiler inside each child server and write the tmon.out file in the directory $ServerRoot/logs/dprof/$$ when the child is shutdown (where $$ is the number of the child process). All it takes is to add to httpd.conf:

 
  PerlModule Apache::DProf

Remember that any PerlHandler that was pulled in before Apache::DProf in the httpd.conf or startup.pl, will not have its code debugging information inserted. To run dprofpp, chdir to $ServerRoot/logs/dprof/$$ and run:

 
  % dprofpp

[ TOC ]


Measuring the Memory Usage of Subroutines

With help of Apache::Status you can find out the size of each and every subroutine.

Now you can start to optimize your code. Or test which of the several implementations is of the least size.

For example let's compare CGI.pm's OO vs procedural interfaces:

As you will see below the first OO script uses about 2k bytes while the second script (procedural interface) uses about 5k.

Here are the code examples and the numbers:

  1.  
      cgi_oo.pl
      ---------
      use CGI ();
      my $q = CGI->new;
      print $q->header;
      print $q->b("Hello");

  2.  
      cgi_mtd.pl
      ---------
      use CGI qw(header b);
      print header();
      print b("Hello");

After executing each script in single server mode (-X) the results are:

  1.  
      Totals: 1966 bytes | 27 OPs

     
      handler 1514 bytes | 27 OPs
      exit     116 bytes |  0 OPs

  2.  
      Totals: 4710 bytes | 19 OPs
      
      handler  1117 bytes | 19 OPs
      basefont  120 bytes |  0 OPs
      frameset  120 bytes |  0 OPs
      caption   119 bytes |  0 OPs
      applet    118 bytes |  0 OPs
      script    118 bytes |  0 OPs
      ilayer    118 bytes |  0 OPs
      header    118 bytes |  0 OPs
      strike    118 bytes |  0 OPs
      layer     117 bytes |  0 OPs
      table     117 bytes |  0 OPs
      frame     117 bytes |  0 OPs
      style     117 bytes |  0 OPs
      Param     117 bytes |  0 OPs
      small     117 bytes |  0 OPs
      embed     117 bytes |  0 OPs
      font      116 bytes |  0 OPs
      span      116 bytes |  0 OPs
      exit      116 bytes |  0 OPs
      big       115 bytes |  0 OPs
      div       115 bytes |  0 OPs
      sup       115 bytes |  0 OPs
      Sub       115 bytes |  0 OPs
      TR        114 bytes |  0 OPs
      td        114 bytes |  0 OPs
      Tr        114 bytes |  0 OPs
      th        114 bytes |  0 OPs
      b         113 bytes |  0 OPs

Note, that the above is correct if you didn't precompile all CGI.pm's methods at server startup. Since if you did, the procedural interface in the second test will take up to 18k and not 5k as we saw. That's because the whole of CGI.pm's namespace is inherited and it already has all its methods compiled, so it doesn't really matter whether you attempt to import only the symbols that you need. So if you have:

 
  use CGI  qw(-compile :all);

in the server startup script. Having:

 
  use CGI qw(header);

or

 
  use CGI qw(:all);

is essentially the same. You will have all the symbols precompiled at startup imported even if you ask for only one symbol. It seems to me like a bug, but probably that's how CGI.pm works.

BTW, you can check the number of opcodes in the code by a simple command line run. For example comparing 'my %hash' vs 'my %hash = ()'.

 
  % perl -MO=Terse -e 'my %hash' | wc -l
  -e syntax OK
      4

 
  % perl -MO=Terse -e 'my %hash = ()' | wc -l
  -e syntax OK
     10

The first one has less opcodes.

[ TOC ]


Know Your Operating System

In order to get the best performance it helps to get intimately familiar with the Operating System (OS) the web server is running on. There are many OS specific things that you may be able to optimise which will improve your web server's speed, reliability and security.

The following sections will unveal some of the most important details you should know about your OS.

[ TOC ]


Sharing Memory

The sharing of memory is one very important factor. If your OS supports it (and most sane systems do), you might save memory by sharing it between child processes. This is only possible when you preload code at server startup. However, during a child process' life its memory pages tend to become unshared.

There is no way we can make Perl allocate memory so that (dynamic) variables land on different memory pages from constants, so the copy-on-write effect (we will explain this in a moment) will hit you almost at random.

If you are pre-loading many modules you might be able to trade off the memory that stays shared against the time for an occasional fork by tuning MaxRequestsPerChild. Each time a child reaches this upper limit and dies it should release its unshared pages. The new child which replaces it will share its fresh pages until it scribbles on them.

The ideal is a point where your processes usually restart before too much memory becomes unshared. You should take some measurements to see if it makes a real difference, and to find the range of reasonable values. If you have success with this tuning the value of MaxRequestsPerChild will probably be peculiar to your situation and may change with changing circumstances.

It is very important to understand that your goal is not to have MaxRequestsPerChild to be 10000. Having a child serving 300 requests on precompiled code is already a huge overall speedup, so if it is 100 or 10000 it probably does not really matter if you can save RAM by using a lower value.

Do not forget that if you preload most of your code at server startup, the newly forked child gets ready very very fast, because it inherits most of the preloaded code and the perl interpreter from the parent process.

During the life of the child its memory pages (which aren't really its own to start with, it uses the parent's pages) gradually get `dirty' - variables which were originally inherited and shared are updated or modified -- and the copy-on-write happens. This reduces the number of shared memory pages, thus increasing the memory requirement. Killing the child and spawning a new one allows the new child to get back to the pristine shared memory of the parent process.

The recommendation is that MaxRequestsPerChild should not be too large, otherwise you lose some of the benefit of sharing memory.

See Choosing MaxRequestsPerChild for more about tuning the MaxRequestsPerChild parameter.

[ TOC ]


How Shared Is My Memory?

You've probably noticed that the word shared is repeated many times in relation to mod_perl. Indeed, shared memory might save you a lot of money, since with sharing in place you can run many more servers than without it. See the Formula and the numbers.

How much shared memory do you have? You can see it by either using the memory utility that comes with your system or you can deploy the GTop module:

 
  use GTop ();
  print "Shared memory of the current process: ",
    GTop->new->proc_mem($$)->share,"\n";
  
  print "Total shared memory: ",
    GTop->new->mem->share,"\n";

When you watch the output of the top utility, don't confuse the RES (or RSS) columns with the SHARE column. RES is RESident memory, which is the size of pages currently swapped in.

[ TOC ]


Calculating Real Memory Usage

I have shown how to measure the size of the process' shared memory, but we still want to know what the real memory usage is. Obviously this cannot be calculated simply by adding up the memory size of each process because that wouldn't account for the shared memory.

On the other hand we cannot just subtract the shared memory size from the total size to get the real memory usage numbers, because in reality each process has a different history of processed requests, therefore the shared memory is not the same for all processes.

So how do we measure the real memory size used by the server we run? It's probably too difficult to give the exact number, but I've found a way to get a fair approximation which was verified in the following way. I have calculated the real memory used, by the technique you will see in the moment, and then have stopped the Apache server and saw that the memory usage report indicated that the total used memory went down by almost the same number I've calculated. Note that some OSs do smart memory pages caching so you may not see the memory usage decrease as soon as it actually happens when you quit the application.

This is a technique I've used:

  1. For each process sum up the difference between shared and system memory. To calculate a difference for a single process use:

     
      use GTop;
      my $proc_mem = GTop->new->proc_mem($$);
      my $diff     = $proc_mem->size - $proc_mem->share;
      print "Difference is $diff bytes\n";

  2. Now if we add the shared memory size of the process with maximum shared memory, we will get all the memory that actually is being used by all httpd processes, except for the parent process.

  3. Finally, add the size of the parent process.

Please note that this might be incorrect for your system, so you use this number on your own risk.

I've used this technique to display real memory usage in the module Apache::VMonitor, so instead of trying to manually calculate this number you can use this module to do it automatically. In fact in the calculations used in this module there is no separation between the parent and child processes, they are all counted indifferently using the following code:

 
  use GTop ();
  my $gtop = GTop->new;
  my $total_real = 0;
  my $max_shared = 0;
  # @mod_perl_pids is initialized by Apache::Scoreboard, irrelevant here
  my @mod_perl_pids = some_code();
  for my $pid (@mod_perl_pids)
    my $proc_mem = $gtop->proc_mem($pid);
    my $size     = $proc_mem->size($pid);
    my $share    = $proc_mem->share($pid);
    $total_real += $size - $share;
    $max_shared  = $share if $max_shared < $share;
  }
  my $total_real += $max_shared;

So as you see we that we accumulate the difference between the shared and reported memory:

 
    $total_real  += $size-$share;

and at the end add the biggest shared process size:

 
  my $total_real += $max_shared;

So now $total_real contains approximately the really used memory.

[ TOC ]


Are My Variables Shared?

How do you find out if the code you write is shared between the processes or not? The code should be shared, except where it is on a memory page with variables that change. Some variables are read-only in usage and never change. For example, if you have some variables that use a lot of memory and you want them to be read-only. As you know the variable becomes unshared when the process modifies its value.

So imagine that you have this 10Mb in-memory database that resides in a single variable, you perform various operations on it and want to make sure that the variable is still shared. For example if you do some matching regular expression (regex) processing on this variable and want to use the pos() function, will it make the variable unshared or not?

The Apache::Peek module comes to rescue. Let's write a module called MyShared.pm which we preload at server startup, so all the variables of this module are initially shared by all children.

 
  MyShared.pm
  ---------
  package MyShared;
  use Apache::Peek;
  
  my $readonly = "Chris";
  
  sub match    { $readonly =~ /\w/g;               }
  sub print_pos{ print "pos: ",pos($readonly),"\n";}
  sub dump     { Dump($readonly);                  }
  1;

This module declares the package MyShared, loads the Apache::Peek module and defines the lexically scoped $readonly variable which is supposed to be a variable of large size (think about a huge hash data structure), but we will use a small one to simplify this example.

The module also defines three subroutines: match() that does a simple character matching, print_pos() that prints the current position of the matching engine inside the string that was last matched and finally the dump() subroutine that calls the Apache::Peek module's Dump() function to dump a raw Perl data-type of the $readonly variable.

Now we write the script that prints the process ID (PID) and calls all three functions. The goal is to check whether pos() makes the variable dirty and therefore unshared.

 
  share_test.pl
  -------------
  use MyShared;
  print "Content-type: text/plain\r\n\r\n";
  print "PID: $$\n";
  MyShared::match();
  MyShared::print_pos();
  MyShared::dump();

Before you restart the server, in httpd.conf set:

 
  MaxClients 2

for easier tracking. You need at least two servers to compare the print outs of the test program. Having more than two can make the comparison process harder.

Now open two browser windows and issue the request for this script several times in both windows, so you get different processes PIDs reported in the two windows and each process has processed a different number of requests to the share_test.pl script.

In the first window you will see something like that:

 
  PID: 27040
  pos: 1
  SV = PVMG(0x853db20) at 0x8250e8c
    REFCNT = 3
    FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK)
    IV = 0
    NV = 0
    PV = 0x8271af0 "Chris"\0
    CUR = 5
    LEN = 6
    MAGIC = 0x853dd80
      MG_VIRTUAL = &vtbl_mglob
      MG_TYPE = 'g'
      MG_LEN = 1

And in the second window:

 
  PID: 27041
  pos: 2
  SV = PVMG(0x853db20) at 0x8250e8c
    REFCNT = 3
    FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK)
    IV = 0
    NV = 0
    PV = 0x8271af0 "Chris"\0
    CUR = 5
    LEN = 6
    MAGIC = 0x853dd80
      MG_VIRTUAL = &vtbl_mglob
      MG_TYPE = 'g'
      MG_LEN = 2

We see that all the addresses of the supposedly big structure are the same (0x8250e8c and 0x8271af0), therefore the variable data structure is almost completely shared. The only difference is in SV.MAGIC.MG_LEN record, which is not shared.

So given that the $readonly variable is a big one, its value is still shared between the processes, while part of the variable data structure is non-shared. But it's almost insignificant because it takes a very little memory space.

Now if you need to compare more than variable, doing it by hand can be quite time consuming and error prune. Therefore it's better to correct the testing script to dump the Perl data-types into files (e.g /tmp/dump.$$, where $$ is the PID of the process) and then using diff(1) utility to see whether there is some difference.

So correcting the dump() function to write the info to the file will do the job. Notice that we use Devel::Peek and not Apache::Peek. The both are almost the same, but Apache::Peek prints it output directly to the opened socket so we cannot intercept and redirect the result to the file. Since Devel::Peek dumps results to the STDERR stream we can use the old trick of saving away the default STDERR handler, and open a new filehandler using the STDERR. In our example when Devel::Peek now prints to STDERR it actually prints to our file. When we are done, we make sure to restore the original STDERR filehandler.

So this is the resulting code:

 
  MyShared2.pm
  ---------
  package MyShared2;
  use Devel::Peek;
  
  my $readonly = "Chris";
  
  sub match    { $readonly =~ /\w/g;               }
  sub print_pos{ print "pos: ",pos($readonly),"\n";}
  sub dump{
    my $dump_file = "/tmp/dump.$$";
    print "Dumping the data into $dump_file\n";
    open OLDERR, ">&STDERR";
    open STDERR, ">".$dump_file or die "Can't open $dump_file: $!";
    Dump($readonly);
    close STDERR ;
    open STDERR, ">&OLDERR";
  }
  1;

When if we modify the code to use the modified module:

 
  share_test2.pl
  -------------
  use MyShared2;
  print "Content-type: text/plain\r\n\r\n";
  print "PID: $$\n";
  MyShared2::match();
  MyShared2::print_pos();
  MyShared2::dump();

And run it as before (with MaxClients 2), two dump files will be created in the directory /tmp. In our test these were created as /tmp/dump.1224 and /tmp/dump.1225. When we run diff(1):

 
  % diff /tmp/dump.1224 /tmp/dump.1225
  12c12
  <       MG_LEN = 1
  ---
  >       MG_LEN = 2

We see that the two padlists (of the variable readonly) are different, as we have observed before when we did a manual comparison.

In fact we if we think about these results again, we get to a conclusion that there is no need for two processes to find out whether the variable gets modified (and therefore unshared). It's enough to check the datastructure before the script was executed and after that. You can modify the MyShared2 module to dump the padlists into a different file after each invocation and than to run the diff(1) on the two files.

If you want to watch whether some lexically scoped (with my()) variables in your Apache::Registry script inside the same process get changed between invocations you can use the Apache::RegistryLexInfo module instead. Since it does exactly this: it makes a snapshot of the padlist before and after the code execution and shows the difference between the two. This specific module was written to work with Apache::Registry scripts so it won't work for loaded modules. Use the technique we have described above for any type of variables in modules and scripts.

Surely another way of ensuring that a scalar is readonly and therefore sharable is to either use the constant pragma or readonly pragma. But then you won't be able to make calls that alter the variable even a little, like in the example that we just showed, because it will be a true constant variable and you will get compile time error if you try this:

 
  MyConstant.pm
  -------------
  package MyConstant;
  use constant readonly => "Chris";
  
  sub match    { readonly =~ /\w/g;               }
  sub print_pos{ print "pos: ",pos(readonly),"\n";}
  1;

 
  % perl -c MyConstant.pm

 
  Can't modify constant item in match position at MyConstant.pm line
  5, near "readonly)"
  MyConstant.pm had compilation errors.

However this code is just right:

 
  MyConstant1.pm
  -------------
  package MyConstant1;
  use constant readonly => "Chris";
  
  sub match { readonly =~ /\w/g; }
  1;

[ TOC ]


Preloading Perl Modules at Server Startup

You can use the PerlRequire and PerlModule directives to load commonly used modules such as CGI.pm, DBI and etc., when the server is started. On most systems, server children will be able to share the code space used by these modules. Just add the following directives into httpd.conf:

 
  PerlModule CGI
  PerlModule DBI

But an even better approach is to create a separate startup file (where you code in plain perl) and put there things like:

 
  use DBI ();
  use Carp ();

Don't forget to prevent importing of the symbols exported by default by the module you are going to preload, by placing empty parentheses () after a module's name. Unless you need some of these in the startup file, which is unlikely. This will save you a few more memory bits.

Then you require() this startup file in httpd.conf with the PerlRequire directive, placing it before the rest of the mod_perl configuration directives:

 
  PerlRequire /path/to/start-up.pl

CGI.pm is a special case. Ordinarily CGI.pm autoloads most of its functions on an as-needed basis. This speeds up the loading time by deferring the compilation phase. When you use mod_perl, FastCGI or another system that uses a persistent Perl interpreter, you will want to precompile the functions at initialization time. To accomplish this, call the package function compile() like this:

 
  use CGI ();
  CGI->compile(':all');

The arguments to compile() are a list of method names or sets, and are identical to those accepted by the use() and import() operators. Note that in most cases you will want to replace ':all' with the tag names that you actually use in your code, since generally you only use a subset of them.

Let's conduct a memory usage test to prove that preloading, reduces memory requirements.

In order to have an easy measurement we will use only one child process, therefore we will use this setting:

 
  MinSpareServers 1
  MaxSpareServers 1
  StartServers 1
  MaxClients 1
  MaxRequestsPerChild 100

We are going to use the Apache::Registry script memuse.pl which consists of two parts: the first one preloads a bunch of modules (that most of them aren't going to be used), the second part reports the memory size and the shared memory size used by the single child process that we start. and of course it prints the difference between the two sizes.

 
  memuse.pl
  ---------
  use strict;
  use CGI ();
  use DB_File ();
  use LWP::UserAgent ();
  use Storable ();
  use DBI ();
  use GTop ();

 
  my $r = shift;
  $r->send_http_header('text/plain');
  my $proc_mem = GTop->new->proc_mem($$);
  my $size  = $proc_mem->size;
  my $share = $proc_mem->share;
  my $diff  = $size - $share;
  printf "%10s %10s %10s\n", qw(Size Shared Difference);
  printf "%10d %10d %10d (bytes)\n",$size,$share,$diff;

First we restart the server and execute this CGI script when none of the above modules preloaded. Here is the result:

 
     Size   Shared     Diff
  4706304  2134016  2572288 (bytes)

Now we take all the modules:

 
  use strict;
  use CGI ();
  use DB_File ();
  use LWP::UserAgent ();
  use Storable ();
  use DBI ();
  use GTop ();

and copy them into the startup script, so they will get preloaded. The script remains unchanged. We restart the server and execute it again. We get the following.

 
     Size   Shared    Diff
  4710400  3997696  712704 (bytes)

Let's put the two results into one table:

 
  Preloading    Size   Shared     Diff
     Yes     4710400  3997696   712704 (bytes)
      No     4706304  2134016  2572288 (bytes)
  --------------------------------------------
  Difference    4096  1863680 -1859584

You can clearly see that when the modules weren't preloaded the shared memory pages size, were about 1864Kb smaller relative to the case where the modules were preloaded.

Assuming that you have had 256M dedicated to the web server, if you didn't preload the modules, you could have:

 
  268435456 = X * 2572288 + 2134016

 
  X = (268435456 - 2134016) / 2572288 = 103 

103 servers.

Now let's calculate the same thing with modules preloaded:

 
  268435456 = X * 712704 + 3997696

 
  X = (268435456 - 3997696) / 712704 = 371

You can have almost 4 times more servers!!!

Remember that we have mentioned before that memory pages gets dirty and the size of the shared memory gets smaller with time? So we have presented the ideal case where the shared memory stays intact. Therefore the real numbers will be a little bit different, but not far from the numbers in our example.

Also it's obvious that in your case it's possible that the process size will be bigger and the shared memory will be smaller, since you will use different modules and a different code, so you won't get this fantastic ratio, but this example is certainly helps to feel the difference.

[ TOC ]


Preloading Registry Scripts at Server Startup

What happens if you find yourself stuck with Perl CGI scripts and you cannot or don't want to move most of the stuff into modules to benefit from modules preloading, so the code will be shared by the children. Luckily you can preload scripts as well. This time the Apache::RegistryLoader modules comes to aid. Apache::RegistryLoader compiles Apache::Registry scripts at server startup.

For example to preload the script /perl/test.pl which is in fact the file /home/httpd/perl/test.pl you would do the following:

 
  use Apache::RegistryLoader ();
  Apache::RegistryLoader->new->handler("/perl/test.pl",
                            "/home/httpd/perl/test.pl");

You should put this code either into <Perl> sections or into a startup script.

But what if you have a bunch of scripts located under the same directory and you don't want to list them one by one. Take the benefit of Perl modules and put them to a good use. The File::Find module will do most of the work for you.

The following code walks the directory tree under which all Apache::Registry scripts are located. For each encountered file with extension .pl, it calls the Apache::RegistryLoader::handler() method to preload the script in the parent server, before pre-forking the child processes:

 
  use File::Find qw(finddepth);
  use Apache::RegistryLoader ();
  {
    my $scripts_root_dir = "/home/httpd/perl/";
    my $rl = Apache::RegistryLoader->new;
    finddepth
      (
       sub {
         return unless /\.pl$/;
         my $url = "$File::Find::dir/$_";
         $url =~ s|$scripts_root_dir/?|/|;
         warn "pre-loading $url\n";
           # preload $url
         my $status = $rl->handler($url);
         unless($status == 200) {
           warn "pre-load of `$url' failed, status=$status\n";
         }
       },
       $scripts_root_dir);
  }

Note that we didn't use the second argument to handler() here, as in the first example. To make the loader smarter about the URI to filename translation, you might need to provide a trans() function to translate the URI to filename. URI to filename translation normally doesn't happen until HTTP request time, so the module is forced to roll its own translation. If filename is omitted and a trans() function was not defined, the loader will try using the URI relative to ServerRoot.

A simple trans() function can be something like that:

 
  sub mytrans {
    my $uri = shift;
    $uri =~ s|^/perl/|/home/httpd/perl/|;
    return $uri;
  }

You can easily derive the right translation by looking at the Alias directive. The above mytrans() function is matching our Alias:

 
  Alias /perl/ /home/httpd/perl/

After defining the URI to filename translation function you should pass it during the creation of the Apache::RegistryLoader object:

 
  my $rl = Apache::RegistryLoader->new(trans => \&mytrans);

I won't show any benchmarks here, since the effect is absolutely the same as with preloading modules.

See also BEGIN blocks

[ TOC ]


Modules Initializing at Server Startup

We have just learned that it's important to preload the modules and scripts at the server startup. It turns out that it's not enough for some modules and you have to prerun their initialization code to get more memory pages shared. Basically you will find an information about specific modules in their respective manpages. We will present a few examples of widely used modules where the code can be initialized.

[ TOC ]


Initializing DBI.pm

The first example is the DBI module. As you know DBI works with many database drivers falling into the DBD:: category, e.g. DBD::mysql. It's not enough to preload DBI, you should initialize DBI with driver(s) that you are going to use (usually a single driver is used), if you want to minimize memory use after forking the child processes. Note that you want to do this under mod_perl and other environments where the shared memory is very important. Otherwise you shouldn't initialize drivers.

You probably know already that under mod_perl you should use the Apache::DBI module to get the connection persistence, unless you open a separate connection for each user--in this case you should not use this module. Apache::DBI automatically loads DBI and overrides some of its methods, so you should continue coding like there is only a DBI module.

Just as with modules preloading our goal is to find the startup environment that will lead to the smallest "difference" between the shared and normal memory reported, therefore a smaller total memory usage.

And again in order to have an easy measurement we will use only one child process, therefore we will use this setting in httpd.conf:

 
  MinSpareServers 1
  MaxSpareServers 1
  StartServers 1
  MaxClients 1
  MaxRequestsPerChild 100

We are going to run memory benchmarks on five different versions of the startup.pl file. We always preload these modules:

 
  use Gtop();
  use Apache::DBI(); # preloads DBI as well

option 1

Leave the file unmodified.

option 2

Install MySQL driver (we will use MySQL RDBMS for our test):

 
  DBI->install_driver("mysql");

It's safe to use this method, since just like with use(), if it can't be installed it'll die().

option 3

Preload MySQL driver module:

 
  use DBD::mysql;

option 4

Tell Apache::DBI to connect to the database when the child process starts (ChildInitHandler), no driver is preload before the child gets spawned!

 
  Apache::DBI->connect_on_init('DBI:mysql:test::localhost',
                             "",
                             "",
                             {
                              PrintError => 1, # warn() on errors
                              RaiseError => 0, # don't die on error
                              AutoCommit => 1, # commit executes
                              # immediately
                             }
                            )
  or die "Cannot connect to database: $DBI::errstr";

Here is the Apache::Registry test script that we have used:

 
  preload_dbi.pl
  --------------
  use strict;
  use GTop ();
  use DBI ();
    
  my $dbh = DBI->connect("DBI:mysql:test::localhost",
                         "",
                         "",
                         {
                          PrintError => 1, # warn() on errors
                          RaiseError => 0, # don't die on error
                          AutoCommit => 1, # commit executes
                                           # immediately
                         }
                        )
    or die "Cannot connect to database: $DBI::errstr";
  
  my $r = shift;
  $r->send_http_header('text/plain');
  
  my $do_sql = "show tables";
  my $sth = $dbh->prepare($do_sql);
  $sth->execute();
  my @data = ();
  while (my @row = $sth->fetchrow_array){
    push @data, @row;
  }
  print "Data: @data\n";
  $dbh->disconnect(); # NOP under Apache::DBI
  
  my $proc_mem = GTop->new->proc_mem($$);
  my $size  = $proc_mem->size;
  my $share = $proc_mem->share;
  my $diff  = $size - $share;
  printf "%8s %8s %8s\n", qw(Size Shared Diff);
  printf "%8d %8d %8d (bytes)\n",$size,$share,$diff;

What it does is opening a connection to the database 'test' and issues a query to learn what tables the databases has. When the data is collected and printed the connection would be closed in the regular case, but Apache::DBI overrides it with empty method. When the data is processed a familiar to you already code to print the memory usage follows.

The server was restarted before each new test.

So here are the results of the five tests that were conducted, sorted by the Diff column:

  1. After the first request:

     
      Version     Size   Shared     Diff        Test type
      --------------------------------------------------------------------
            1  3465216  2621440   843776  install_driver
            2  3461120  2609152   851968  install_driver & connect_on_init
            3  3465216  2605056   860160  preload driver
            4  3461120  2494464   966656  nothing added
            5  3461120  2482176   978944  connect_on_init

  2. After the second request (all the subsequent request showed the same results):

     
      Version     Size   Shared    Diff         Test type
      --------------------------------------------------------------------
            1  3469312  2609152   860160  install_driver
            2  3481600  2605056   876544  install_driver & connect_on_init
            3  3469312  2588672   880640  preload driver
            4  3477504  2482176   995328  nothing added
            5  3481600  2469888  1011712  connect_on_init

Now what do we conclude from looking at these numbers. First we see that only after a second reload we get the final memory footprint for a specific request in question (if you pass different arguments the memory usage might and will be different).

But both tables show the same pattern of memory usage. We can clearly see that the real winner is the startup.pl file's version where the MySQL driver was installed (1). Since we want to have a connection ready for the first request made to the freshly spawned child process, we generally use the second version (2) which uses somewhat more memory, but has almost the same number of shared memory pages. The third version only preloads the driver which results in smaller shared memory. The last two versions having nothing initialized (4) and having only the connect_on_init() method used (5). The former is a little bit better than the latter, but both significantly worse than the first two versions.

To remind you why do we look for the smallest value in the column diff, recall the real memory usage formula:

 
  RAM_dedicated_to_mod_perl = diff * number_of_processes
                            + the_processes_with_largest_shared_memory

Notice that the the smaller the diff is, the bigger the number of processes you can have using the same amount of RAM. Therefore every 100K difference counts, when you multiply it by the number of processes. If we take the number from the version version (1) vs (4) and assume that we have 256M of memory dedicated to mod_perl processes we will get the following numbers using the formula derived from the above formula:

 
               RAM - largest_shared_size
  N_of Procs = -------------------------
                        Diff

 
                268435456 - 2609152
  (ver 1)  N =  ------------------- = 309
                      860160

 
                268435456 - 2469888
  (ver 5)  N =  ------------------- = 262
                     1011712

So you can tell the difference (17% more child processes in the first version).

[ TOC ]


Initializing CGI.pm

CGI.pm is a big module that by default postpones the compilation of its methods until they are actually needed, thus making it possible to use it under a slow mod_cgi handler without adding a big overhead. That's not what we want under mod_perl and if you use CGI.pm you should precompile the methods that you are going to use at the server startup in addition to preloading the module. Use the compile method for that:

 
  use CGI;
  CGI->compile(':all');

where you should replace the tag group :all with the real tags and group tags that you are going to use if you want to optimize the memory usage.

We are going to compare the shared memory foot print by using the script which is back compatible with mod_cgi. You will see that you can improve performance of this kind of scripts as well, but if you really want a fast code think about porting it to use Apache::Request for CGI interface and some other module for HTML generation.

So here is the Apache::Registry script that we are going to use to make the comparison:

 
  preload_cgi_pm.pl
  -----------------
  use strict;
  use CGI ();
  use GTop ();

 
  my $q = new CGI;
  print $q->header('text/plain');
  print join "\n", map {"$_ => ".$q->param($_) } $q->param;
  print "\n";
  
  my $proc_mem = GTop->new->proc_mem($$);
  my $size  = $proc_mem->size;
  my $share = $proc_mem->share;
  my $diff  = $size - $share;
  printf "%8s %8s %8s\n", qw(Size Shared Diff);
  printf "%8d %8d %8d (bytes)\n",$size,$share,$diff;

The script initializes the CGI object, sends HTTP header and then print all the arguments and values that were passed to the script if at all. At the end as usual we print the memory usage.

As usual we are going to use a single child process, therefore we will use this setting in httpd.conf:

 
  MinSpareServers 1
  MaxSpareServers 1
  StartServers 1
  MaxClients 1
  MaxRequestsPerChild 100

We are going to run memory benchmarks on three different versions of the startup.pl file. We always preload this module:

 
  use Gtop();

option 1

Leave the file unmodified.

option 2

Preload CGI.pm:

 
  use CGI ();

option 3

Preload CGI.pm and pre-compile the methods that we are going to use in the script:

 
  use CGI ();
  CGI->compile(qw(header param));

The server was restarted before each new test.

So here are the results of the five tests that were conducted, sorted by the Diff column:

  1. After the first request:

     
      Version     Size   Shared     Diff        Test type
      --------------------------------------------------------------------
            1  3321856  2146304  1175552  not preloaded
            2  3321856  2326528   995328  preloaded
            3  3244032  2465792   778240  preloaded & methods+compiled

  2. After the second request (all the subsequent request showed the same results):

     
      Version     Size   Shared    Diff         Test type
      --------------------------------------------------------------------
            1  3325952  2134016  1191936 not preloaded
            2  3325952  2314240  1011712 preloaded
            3  3248128  2445312   802816 preloaded & methods+compiled

The first version shows the results of the script execution when CGI.pm wasn't preloaded. The second version with module preloaded. The third when it's both preloaded and the methods that are going to be used are precompiled at the server startup.

By looking at the version one of the second table we can conclude that, preloading adds about 20K of shared size. As we have mention at the beginning of this section that's how CGI.pm was implemented--to reduce the load overhead. Which means that preloading CGI is almost hardly change a thing. But if we compare the second and the third versions we will see a very significant difference of 207K (1011712-802816), and we have used only a few methods (the header method loads a few more method transparently for a user). Imagine how much memory we are going to save if we are going to precompile all the methods that we are using in other scripts that use CGI.pm and do a little bit more than the script that we have used in the test.

But even in our very simple case using the same formula, what do we see? (assuming that we have 256MB dedicated for mod_perl)

 
               RAM - largest_shared_size
  N_of Procs = -------------------------
                        Diff

 
                268435456 - 2134016
  (ver 1)  N =  ------------------- = 223
                      1191936

 
                268435456 - 2445312
  (ver 3)  N =  ------------------- = 331
                     802816

If we preload CGI.pm and precompile a few methods that we use in the test script, we can have 50% more child processes than when we don't preload and precompile the methods that we are going to use.

META: I've heard that the 3.x generation will be less bloated, so probably I'll have to rerun this using the new version.

[ TOC ]


Memory Swapping is Considered Bad

Swap memory is slow since it resides on the hard disc, which is much slower than the RAM.

Why and when the swapping happens? Your machine starts to swap when there is no more available RAM to use for the process currently using the CPU. To gain the required number of memory pages requested by the process the kernel has to page out (i.e. move to a hard disk's swap partition) exactly the same number of pages. The kernel pages out memory pages using the LRU (least recently used) or a similar algorithm to decide which pages to take out, in attempt to prevent the situation where the next process in queue for CPU will need exactly the pages that were paged out, and prevent the page fault (when the page is not found in RAM) occur, which in turn will require more pages to be swapped out in order to free the memory for the pages that get swapped in.

When the CPU has to page memory pages in and out things slow down, causing processing demands to go up, which in turn slows down the system even more as more memory is required. This ever worstening spiral will lead the machine to halt, unless the resource demand suddenly drops down and allows the processes to catch up with their tasks and go back to normal memory usage.

When tuning the performance of your box, you must configure all the runnning software components in such a way that no memory swapping will occur: even during peak hours. Swap space is an emergency pool, not a resource to be used routinely. If you are low on memory and you badly need it, buy it. Memory is cheap.

For swapping monitoring techniques see the section 'Apache::VMonitor -- Visual System and Apache Server Monitor'.

For the mod_perl specific swapping prevention guideliness see the section 'Choosing MaxClients'.

[ TOC ]


Increasing Shared Memory With mergemem

mergemem is an experimental utility for linux, which looks very interesting for us mod_perl users: http://www.complang.tuwien.ac.at/ulrich/mergemem/

It looks like it could be run periodically on your server to find and merge duplicate pages. There are caveats: it would halt your httpds during the merge (it appears to be very fast, but still ...).

This software comes with a utility called memcmp to tell you how much you might save.

[ReaderMeta]: If you have tried this utility, please let us know what do you think about it! Thanks

[ TOC ]


Forking and Executing Subprocesses from mod_perl

In general you should not fork from your mod_perl scripts, since when you do, you are forking the entire Apache Web server, lock, stock and barrel. Not only is your Perl code and Perl interpreter being duplicated, but so is mod_ssl, mod_rewrite, mod_log, mod_proxy, mod_speling (it's not a typo!) or whatever modules you have used in your server, all the core routines...

[ TOC ]


Spawning a Detachable Sub-Process

A much better approach would be to spawn a sub-process, hand it the information it needs to do the task, and have it detach (close STDIN, STDOUT and STDERR + execute setsid()). This is wise only if the parent which spawns this process immediately continues, and does not wait for the sub-process to complete. This approach is suitable for a situation when you want to use the Web interface to trigger a process which takes a long time, such as processing lots of data or sending email to thousands of registered users (no SPAM please!). Otherwise, you should convert the code into a module, and call its functions and methods from a Perl handler or a CGI script.

Just like with fork(), using the system call system() defeats the whole idea behind mod_perl. The Perl interpreter and modules would be loaded again for this external program to run if it's a Perl program. Remember that the backticks (`program`) and qx(program) variants of system() behave in the same way.

If you really have to use system() then the approach to take is this:

 
  spawn_process.pl
  ----------------
  use FreezeThaw ();
  $params=FreezeThaw::freeze(
        [all data to pass to the other process]
        );
  system("detach_program.pl", $params);

Notice that we do a system() call with arguments separated by commas, rather than passing them all as a single argument. This calling style prevents the unwanted replacement of the shell metacharacters (like *,?), if some happened to be in $params. We will talk about this in a moment.

And now the source of detach_program.pl :

 
  detach_program.pl
  -----------------
  use POSIX qw(setsid);
  use FreezeThaw ();
  @params=FreezeThaw::thaw(shift @ARGV);
  # check that @params is ok
  close STDIN;
  close STDOUT;
  close STDERR;
  # you might need to reopen the STDERR, i.e.
  # open STDERR, ">/dev/null";
  setsid(); # to detach
  # now do something time consuming

At this point, detach_program.pl is running in the "background" while the call system() returns and permits Apache to get on with things.

This has obvious problems:

However, you might be trying to do the "wrong thing". If what you really want is to send information to the browser and then do some post-processing, look into the PerlCleanupHandler directive. The latter allows you to tell the child process after request has been processed and user has received the response.

[META: example of PerlCleanupHandler code?]

[ TOC ]


Gory Details About fork()

Here is what actually happens when you fork() and call system(). Let's take a simple fragment of code:

 
  system("echo","Hi"),CORE::exit(0) unless fork();

The above code which might be more familiar in this form:

 
  if (fork){
    #do nothing
  } else {
    system("echo","Hi");
    CORE::exit(0);
  }

Notice that we use CORE::exit() and not exit(), which would be automatically overriden by Apache::exit() if used in conjunction with Apache::Registry and similar modules.

In our example script the fork() call creates two execution paths, one for the parent and the other for the child. The fork call returns the PID of the spawned child to the parent process and the child process receives 0. Therefore we can rewrite the example as:

 
  if ($pid = fork){
    # I'm parent
    print "parent $pid\n";
    #do nothing
  } else {
    # I'm child
    print "child $pid\n";
    system("echo","Hi");
    CORE::exit(0);
  }

For the sake of completeness, when you write a real code, you should check the return value from fork(), since if it returns undef it means that the call was unsuccessful and no process was spawned. Something that can happen when the system is running too many processes and cannot spawn new ones.

The child shares with parent its memory pages until it has to modify some of them, which triggers a copy-on-write process which copies these pages to the child's domain before the child is allowed to modify them. But this all happens afterwards. At the moment the fork() call executed, the only work to be done before the child process goes on its separate way is setting up the page tables for the virtual memory, which imposes almost no delay at all.

Unfortunately mod_perl is a big process, with a big memory pages table. When the child processes executes system(), lots of things get loaded which creates a big overhead and tremendously slows down the request processing. The whole point of mod_perl is to avoid having to fork() on every request. Perl can do just about anything by itself.

Let's complete the explanation of our example: the parent will immediately continue with the code that comes after the fork (none in our case), while the forked (child) process will execute system("echo","Hi") and then quit.

[ TOC ]


Executing system() in the Right Way

Let's reuse the last example and concentrate on this call:

 
  system("echo","Hi");

Perl will the first argument as a program to execute, find /bin/echo along the search path, and invoke it directly.

Perl's system() is not the system(3) call [C-library]. How the arguments to system() get interpreted? When there is a single argument to system(), it'll be checked for for having shell metacharacters first(like *,?), and if there are any--perl interpreter invokes a real shell program (/bin/sh -c on Unix platforms) which adds another delay. If you pass a list of arguments to system(), they will be not checked, but split into words if required and passed directly to execvp(), which is more efficient. That's a very nice optimization. In other words, only if you do:

 
  system "sh -c 'echo *'"

will the operating system actually exec() a copy of /bin/sh to parse your command. But since one is almost certainly already running somewhere, the system will notice that (via the disk inode reference) and replace your virtual memory page table with one pointing to the existing program code plus your data space.

[ TOC ]


Avoiding Zombie Processes

Now let's talk about zombie processes.

Normally, every process has its parent. Many processes are children of the init process, whose PID is 1. When you fork a process you must wait() or waitpid() for it to finish. If you don't wait() for it, it becomes a zombie.

A zombie is a process that doesn't have a parent. When the child quits, it reports the termination to its parent. If no parent wait()s to collect the exit status of the child, it gets "confused" and becomes a ghost process, that can be seen as a process, but not killed. It will be killed only when you stop the parent process that spawned it!

Generally the ps(1) utility displays these processes with the <defunc> tag, and you will see the zombies counter increment when doing top(). These zombie processes can take up system resources and are generally undesirable.

So the proper way to do a fork is:

 
  my $r = shift;
  $r->send_http_header('text/plain');
  
  defined (my $kid = fork) or die "Cannot fork: $!";
  if ($kid) {
    waitpid($kid,0);
    print "Parent has finished\n";
  } else {
      # do something
      CORE::exit(0);
  }

In most cases the only reason you would want to fork is when you need to spawn a process that will take a long time to complete. So if the Apache process that spawns this new child process has to wait for it to finish, you have gained nothing. You can neither wait for its completion (because you don't have the time to), nor continue because you will get yet another zombie process. This is called a blocking call, since the process is blocked to do anything else before this call gets completed.

The simplest solution is to ignore your dead children. Just add this line before the fork() call:

 
  $SIG{CHLD} = IGNORE;

When you set the CHLD (SIGCHLD in C) signal handler to IGNORE, all the processes will be collected by the init process and are therefore prevented from becoming zombies. This doesn't work everywhere, however. It proved to work at least on Linux OS.

Note that you cannot localize this setting with local(). If you do, it won't have the desired effect.

[META: Anyone like to explain why it doesn't work?]

The other thing that you must do is to close all the pipes to the connection socket that were opened by the parent process (i.e. STDIN and STDOUT) and inherited by the child, so the parent will be able to complete the request and free itself for serving other requests. You may need to close and reopen the STDERR filehandle. It's opened to append to the error_log file as inherited from its parent, so chances are that you will want to leave it untouched.

Of course if your child needs any of the STDIN, STDOUT or STDERR streams you should reopen them. But you must untie the parent process, so you should close them first.

So now the code would look like this:

 
  my $r = shift;
  $r->send_http_header('text/plain');
  
  $SIG{CHLD} = IGNORE;
  
  defined (my $kid = fork) or die "Cannot fork: $!\n";
  if ($kid) {
    print "Parent has finished\n";
  } else {
      close STDIN;
      close STDOUT;
      close STDERR;
      # do something time-consuming
      CORE::exit(0);
  }

Note that waitpid() call has gone. The $SIG{CHLD} = IGNORE; statement protects us from zombies, as explained above.

Another, more portable, but slightly more expensive solution is to use a double fork approach.

 
  my $r = shift;
  $r->send_http_header('text/plain');
  
  defined (my $kid = fork) or die "Cannot fork: $!\n";
  if ($kid) {
    waitpid($kid,0);
  } else {
    defined (my $grandkid = fork) or die "Kid cannot fork: $!\n";
    if ($grandkid) {
      CORE::exit(0);
  
    } else {
      # code here
      close STDIN;
      close STDOUT;
      close STDERR;
      # do something long lasting
      CORE::exit(0);
    }
  }

Grandkid becomes a "child of init", i.e. the child of the process whose PID is 1.

Note that the previous two solutions do allow you to know the exit status of the process, but in our example we didn't care about it.

Another solution is to use a different SIGCHLD handler:

 
  use POSIX 'WNOHANG';
  $SIG{CHLD} = sub { while( waitpid(-1,WNOHANG)>0 ) {} };

Which is useful when you fork() more than one process. The handler could call wait() as well, but for a variety of reasons involving the handling of stopped processes and the rare event in which two children exit at nearly the same moment, the best technique is to call waitpid() in a tight loop with a first argument of -1 and a second argument of WNOHANG. Together these arguments tell waitpid() to reap the next child that's available, and prevent the call from blocking if there happens to be no child ready for reaping. The handler will loop until waitpid() returns a negative number or zero, indicating that no more reapable children remain.

While you test and debug your code that uses one of the above examples, You might want to write some debug information to the error_log file so you know what happens.

Read perlipc manpage for more information about signal handlers.

Check also Apache::SubProcess for better system() and exec() implementations for mod_perl.

META: why is it a better thing to use?

[ TOC ]


OS Specific Parameters for Proxying

Most of the mod_perl enabled servers use a proxy front-end server. This is done in order to avoid serving static objects, and also so that generated output which might be received by slow clients does not cause the heavy but very fast mod_perl servers from idly waiting.

There are very important OS parameters that you might want to change in order to improve the server performance. This topic is discussed in the section: Setting the Buffering Limits on Various OSes

[ TOC ]


Performance Tuning by Tweaking Apache Configuration

Correct configuration of the MinSpareServers, MaxSpareServers, StartServers, MaxClients, and MaxRequestsPerChild parameters is very important. There are no defaults. If they are too low, you will under-use the system's capabilities. If they are too high, the chances are that the server will bring the machine to its knees.

All the above parameters should be specified on the basis of the resources you have. With a plain apache server, it's no big deal if you run many servers since the processes are about 1Mb and don't eat a lot of your RAM. Generally the numbers are even smaller with memory sharing. The situation is different with mod_perl. I have seen mod_perl processes of 20Mb and more. Now if you have MaxClients set to 50: 50x20Mb = 1Gb. Do you have 1Gb of RAM? Maybe not. So how do you tune the parameters? Generally by trying different combinations and benchmarking the server. Again mod_perl processes can be of much smaller size with memory sharing.

Before you start this task you should be armed with the proper weapon. You need the crashme utility, which will load your server with the mod_perl scripts you possess. You need it to have the ability to emulate a multiuser environment and to emulate the behavior of multiple clients calling the mod_perl scripts on your server simultaneously. While there are commercial solutions, you can get away with free ones which do the same job. You can use the ApacheBench ab utility which comes with the Apache distribution, the crashme script which uses LWP::Parallel::UserAgent or httperf (see the Download page).

It is important to make sure that you run the load generator (the client which generates the test requests) on a system that is more powerful than the system being tested. After all we are trying to simulate Internet users, where many users are trying to reach your service at once. Since the number of concurrent users can be quite large, your testing machine must be very powerful and capable of generating a heavy load. Of course you should not run the clients and the server on the same machine. If you do, your test results would be invalid. Clients will eat CPU and memory that should be dedicated to the server, and vice versa.

See also two tools for benchmarking: ApacheBench and crashme test

[ TOC ]


Tuning with ab - ApacheBench

ApacheBench (ab) is a tool for benchmarking your Apache HTTP server. It is designed to give you an idea of the performance that your current Apache installation can give. In particular, it shows you how many requests per second your Apache server is capable of serving. The ab tool comes bundled with the Apache source distribution, and it's free. :)

Let's try it. We will simulate 10 users concurrently requesting a very light script at www.example.com:81/test/test.pl. Each simulated user makes 10 requests.

 
  % ./ab -n 100 -c 10 www.example.com:81/test/test.pl

The results are:

 
  Concurrency Level:      10
  Time taken for tests:   0.715 seconds
  Complete requests:      100
  Failed requests:        0
  Non-2xx responses:      100
  Total transferred:      60700 bytes
  HTML transferred:       31900 bytes
  Requests per second:    139.86
  Transfer rate:          84.90 kb/s received
  
  Connection Times (ms)
                min   avg   max
  Connect:        0     0     3
  Processing:    13    67    71
  Total:         13    67    74

The only numbers we really care about are:

 
  Complete requests:      100
  Failed requests:        0
  Requests per second:    139.86

Let's raise the request load to 100 x 10 (10 users, each makes 100 requests):

 
  % ./ab -n 1000 -c 10 www.example.com:81/perl/access/access.cgi
  Concurrency Level:      10
  Complete requests:      1000
  Failed requests:        0
  Requests per second:    139.76

As expected, nothing changes -- we have the same 10 concurrent users. Now let's raise the number of concurrent users to 50:

 
  % ./ab -n 1000 -c 50 www.example.com:81/perl/access/access.cgi
  Complete requests:      1000
  Failed requests:        0
  Requests per second:    133.01

We see that the server is capable of serving 50 concurrent users at an amazing 133 requests per second! Let's find the upper limit. Using -n 10000 -c 1000 failed to get results (Broken Pipe?). Using -n 10000 -c 500 resulted in 94.82 requests per second. The server's performance went down with the high load.

The above tests were performed with the following configuration:

 
  MinSpareServers 8
  MaxSpareServers 6
  StartServers 10
  MaxClients 50
  MaxRequestsPerChild 1500

Now let's kill each child after it serves a single request. We will use the following configuration:

 
  MinSpareServers 8
  MaxSpareServers 6
  StartServers 10
  MaxClients 100
  MaxRequestsPerChild 1

Simulate 50 users each generating a total of 20 requests:

 
  % ./ab -n 1000 -c 50 www.example.com:81/perl/access/access.cgi

The benchmark timed out with the above configuration.... I watched the output of ps as I ran it, the parent process just wasn't capable of respawning the killed children at that rate. When I raised the MaxRequestsPerChild to 10, I got 8.34 requests per second. Very bad - 18 times slower! You can't benchmark the importance of the MinSpareServers, MaxSpareServers and StartServers with this kind of test.

Now let's reset MaxRequestsPerChild to 1500, but reduce MaxClients to 10 and run the same test:

 
  MinSpareServers 8
  MaxSpareServers 6
  StartServers 10
  MaxClients 10
  MaxRequestsPerChild 1500

I got 27.12 requests per second, which is better but still 4-5 times slower. (I got 133 with MaxClients set to 50.)

Summary: I have tested a few combinations of the server configuration variables (MinSpareServers, MaxSpareServers, StartServers, MaxClients and MaxRequestsPerChild). The results I got are as follows:

MinSpareServers, MaxSpareServers and StartServers are only important for user response times. Sometimes users will have to wait a bit.

The important parameters are MaxClients and MaxRequestsPerChild. MaxClients should be not too big, so it will not abuse your machine's memory resources, and not too small, for if it is your users will be forced to wait for the children to become free to serve them. MaxRequestsPerChild should be as large as possible, to get the full benefit of mod_perl, but watch your server at the beginning to make sure your scripts are not leaking memory, thereby causing your server (and your service) to die very fast.

Also it is important to understand that we didn't test the response times in the tests above, but the ability of the server to respond under a heavy load of requests. If the test script was heavier, the numbers would be different but the conclusions very similar.

The benchmarks were run with:

 
  HW: RS6000, 1Gb RAM
  SW: AIX 4.1.5 . mod_perl 1.16, apache 1.3.3
  Machine running only mysql, httpd docs and mod_perl servers.
  Machine was _completely_ unloaded during the benchmarking.

After each server restart when I changed the server's configuration, I made sure that the scripts were preloaded by fetching a script at least once for every child.

It is important to notice that none of the requests timed out, even if it was kept in the server's queue for more than a minute! That is the way ab works, which is OK for testing purposes but will be unacceptable in the real world - users will not wait for more than five to ten seconds for a request to complete, and the client (i.e. the browser) will time out in a few minutes.

Now let's take a look at some real code whose execution time is more than a few milliseconds. We will do some real testing and collect the data into tables for easier viewing.

I will use the following abbreviations:

 
  NR    = Total Number of Request
  NC    = Concurrency
  MC    = MaxClients
  MRPC  = MaxRequestsPerChild
  RPS   = Requests per second

Running a mod_perl script with lots of mysql queries (the script under test is mysqld limited) (http://www.example.com:81/perl/access/access.cgi?do_sub=query_form), with the configuration:

 
  MinSpareServers        8
  MaxSpareServers       16
  StartServers          10
  MaxClients            50
  MaxRequestsPerChild 5000

gives us:

 
     NR   NC    RPS     comment
  ------------------------------------------------
     10   10    3.33    # not a reliable figure
    100   10    3.94    
   1000   10    4.62    
   1000   50    4.09    

Conclusions: Here I wanted to show that when the application is slow (not due to perl loading, code compilation and execution, but limited by some external operation) it almost does not matter what load we place on the server. The RPS (Requests per second) is almost the same. Given that all the requests have been served, you have the ability to queue the clients, but be aware that anything that goes into the queue means a waiting client and a client (browser) that might time out!

Now we will benchmark the same script without using the mysql (code limited by perl only): (http://www.example.com:81/perl/access/access.cgi), it's the same script but it just returns the HTML form, without making SQL queries.

 
  MinSpareServers        8
  MaxSpareServers       16
  StartServers          10
  MaxClients            50
  MaxRequestsPerChild 5000

 
     NR   NC      RPS   comment
  ------------------------------------------------
     10   10    26.95   # not a reliable figure
    100   10    30.88   
   1000   10    29.31
   1000   50    28.01
   1000  100    29.74
  10000  200    24.92
 100000  400    24.95

Conclusions: This time the script we executed was pure perl (not limited by I/O or mysql), so we see that the server serves the requests much faster. You can see the number of requests per second is almost the same for any load, but goes lower when the number of concurrent clients goes beyond MaxClients. With 25 RPS, the machine simulating a load of 400 concurrent clients will be served in 16 seconds. To be more realistic, assuming a maximum of 100 concurrent clients and 30 requests per second, the client will be served in 3.5 seconds. Pretty good for a highly loaded server.

Now we will use the server to its full capacity, by keeping all MaxClients clients alive all the time and having a big MaxRequestsPerChild, so that no child will be killed during the benchmarking.

 
  MinSpareServers       50
  MaxSpareServers       50
  StartServers          50
  MaxClients            50
  MaxRequestsPerChild 5000
  
     NR   NC      RPS   comment
  ------------------------------------------------
    100   10    32.05
   1000   10    33.14
   1000   50    33.17
   1000  100    31.72
  10000  200    31.60

Conclusion: In this scenario there is no overhead involving the parent server loading new children, all the servers are available, and the only bottleneck is contention for the CPU.

Now we will change MaxClients and watch the results: Let's reduce MaxClients to 10.

 
  MinSpareServers        8
  MaxSpareServers       10
  StartServers          10
  MaxClients            10
  MaxRequestsPerChild 5000
  
     NR   NC      RPS   comment
  ------------------------------------------------
     10   10    23.87   # not a reliable figure
    100   10    32.64 
   1000   10    32.82
   1000   50    30.43
   1000  100    25.68
   1000  500    26.95
   2000  500    32.53

Conclusions: Very little difference! Ten servers were able to serve almost with the same throughput as 50 servers. Why? My guess is because of CPU throttling. It seems that 10 servers were serving requests 5 times faster than when we worked with 50 servers. In that case, each child received its CPU time slice five times less frequently. So having a big value for MaxClients, doesn't mean that the performance will be better. You have just seen the numbers!

Now we will start drastically to reduce MaxRequestsPerChild:

 
  MinSpareServers        8
  MaxSpareServers       16
  StartServers          10
  MaxClients            50

 
     NR   NC    MRPC     RPS    comment
  ------------------------------------------------
    100   10      10    5.77 
    100   10       5    3.32
   1000   50      20    8.92
   1000   50      10    5.47
   1000   50       5    2.83
   1000  100      10    6.51

Conclusions: When we drastically reduce MaxRequestsPerChild, the performance starts to become closer to plain mod_cgi.

Here are the numbers of this run with mod_cgi, for comparison:

 
  MinSpareServers        8
  MaxSpareServers       16
  StartServers          10
  MaxClients            50
  
     NR   NC    RPS     comment
  ------------------------------------------------
    100   10    1.12
   1000   50    1.14
   1000  100    1.13

Conclusion: mod_cgi is much slower. :) In the first test, when NR/NC was 100/10, mod_cgi was capable of 1.12 requests per second. In the same circumstances, mod_perl was capable of 32 requests per second, nearly 30 times faster! In the first test each client waited about 100 seconds to be served. In the second and third tests they waited 1000 seconds!

[ TOC ]


Tuning with httperf

httperf is a utility written by David Mosberger. Just like ApacheBench, it measures the performance of the webserver.

A sample command line is shown below:

 
  httperf --server hostname --port 80 --uri /test.html \
   --rate 150 --num-conn 27000 --num-call 1 --timeout 5

This command causes httperf to use the web server on the host with IP name hostname, running at port 80. The web page being retrieved is /test.html and, in this simple test, the same page is retrieved repeatedly. The rate at which requests are issued is 150 per second. The test involves initiating a total of 27,000 TCP connections and on each connection one HTTP call is performed. A call consists of sending a request and receiving a reply.

The timeout option defines the number of seconds that the client is willing to wait to hear back from the server. If this timeout expires, the tool considers the corresponding call to have failed. Note that with a total of 27,000 connections and a rate of 150 per second, the total test duration will be approximately 180 seconds (27,000/150), independently of what load the server can actually sustain. Here is a result that one might get:

 
     Total: connections 27000 requests 26701 replies 26701 test-duration 179.996 s
    
     Connection rate: 150.0 conn/s (6.7 ms/conn, <=47 concurrent connections)
     Connection time [ms]: min 1.1 avg 5.0 max 315.0 median 2.5 stddev 13.0
     Connection time [ms]: connect 0.3
     
     Request rate: 148.3 req/s (6.7 ms/req)
     Request size [B]: 72.0
     
     Reply rate [replies/s]: min 139.8 avg 148.3 max 150.3 stddev 2.7 (36 samples)
     Reply time [ms]: response 4.6 transfer 0.0
     Reply size [B]: header 222.0 content 1024.0 footer 0.0 (total 1246.0)
     Reply status: 1xx=0 2xx=26701 3xx=0 4xx=0 5xx=0
     
     CPU time [s]: user 55.31 system 124.41 (user 30.7% system 69.1% total 99.8%)
     Net I/O: 190.9 KB/s (1.6*10^6 bps)
     
     Errors: total 299 client-timo 299 socket-timo 0 connrefused 0 connreset 0
     Errors: fd-unavail 0 addrunavail 0 ftab-full 0 other 0

[ TOC ]


Tuning with the crashme Script

This is another crashme suite originally written by Michael Schilli and located at http://www.linux-magazin.de/ausgabe.1998.08/Pounder/pounder.html . I made a few modifications, mostly adding my() operators. I also allowed it to accept more than one url to test, since sometimes you want to test more than one script.

The tool provides the same results as ab above but it also allows you to set the timeout value, so requests will fail if not served within the time out period. You also get values for Latency (seconds per request) and Throughput (requests per second). It can do a complete simulation of your favorite Netscape browser :) a