If your DansGuardian runs under NetBSD/FreeBSD/OpenBSD and operation seems less reliable than it should be, here's some possible fixes.
For years there has been a low level but naggingly persistent series of reports that DansGuardian doesn't run as reliably under BSD-derived kernels (NetBSD/FreeBSD/OpenBSD) as it does under Linux. Most users never see any problem at all …but a few unlucky ones do. Occasional failure of a DansGuardian child process may be tolerable, as recovery is automatic and the jerky operation is visible to only a single user. However frequent failure of all (or at least most) DansGuardian child processes, or failure of the DansGuardian parent process, will not be tolerable. (Also see questions Installation#26 and Installation#26b in the Wiki FAQ.)
To put it as briefly as possible (perhaps oversimplifying), BSD-derived kernels may need to be tuned in order to obtain stable DansGuardian operation. If there are kernel issues, DansGuardian is likely to start up and run for a while but then fail/crash with a segment fault (SEGV, SIGSEGV), usually in an “impossible” location. The kernel should be tuned for “peak” conditions; if the kernel's been closely tuned for “average” conditions (or worse tuned to “minimize” kernel size), unstable DansGuardian operation is almost inevitable! Performance monitoring tools such as `top` may be misleading, as the spikes of activity that affect DansGuardian are far shorter than even the tools' minimum sample time.
Fortunately even though the problem has not yet been completely pinned down, it's fairly well understood. The current OpenBSD kernel doesn't handle a couple of conditions as well as a typical Linux kernel. One of those problem conditions is sustained high load; the other condition is long-lived processes whose memory address space gets very fragmented (usually because they handle lots and lots of different small requests). The three common applications that are most likely to expose these kernel limitations are Apache (the web server), Squid, and DansGuardian.
The programmers behind OpenBSD are very aware of these problems, and keep fiddling with the problematic parts of their kernel. As a result, the exact failure symptoms can change considerably from one kernel version to another. The kernel itself might panic (thus shutting down the entire system), or an application may just hang, or an application may disappear without proper warning or notice, or an application may shut itself down after receiving more failure return codes than it can handle.
The chief problems seem to be i) an apparent shortage of memory because of massive address space fragmentation and ii) a lack of socket structures. It may also be the case that iii) there are not enough “file descriptor” structures. Recoding applications to better handle the known OpenBSD limitations does not seem to be a reasonable option, as it would probably both a) require the tremendous effort of a complete rewrite and b) just trade stability under OpenBSD for instability under Linux.
Such problems are performance-related; the faster an application runs, the less time the kernel has to cover over these flaws before they grow large enough to become visible. Since performance is continually improved in most applications, later versions of most applications expose worse problems.
Problems are often noticed right after an application upgrade. Administrators focus more attention on the application right after an upgrade. The system load level has likely risen over time, but slowly so nobody noticed. Newer application versions typically provide somewhat better performance, making it more likely the kernel will exhibit problems. And other applications and services may have been changed at the same time. As a result of all these things, it's easy to mis-conclude problems have something to do with a “bug” that was recently introduced into the application.
If you're one of the unlucky ones, you could of course either switch away from OpenBSD or learn to live with the occasional problem. But it's quite likely neither of these options are desirable. So what else can you do? Here's a thorough list of suggestions; most likely the first thing you should try is tune the kernel; simply tuning the kernel may completely resolve the problem. (Some suggestions mainly address high load, and probably won't help the memory fragmentation problem very much. Some suggestions mainly address process memory, and probably won't help very much if you suffer from frequent overloads.) Find the suggestions that fit your situation, and pursue them.
- Upgrade your kernel
Each kernel version seems to be an improvement over the previous one. (The problems may not be completely fixed yet though; problems have been reported on kernels at least as late as version 4.3 and perhaps later.)
- Add RAM
This helps the problem in three different ways. First, newer OpenBSD kernels (but not older ones) reconfigure themselves every boot depending on how much RAM they see; if there's more RAM, all the kernel configuration options are increased. Second, more memory allows applications to spread out a little more so problems don't become visible quite so soon. And third, more memory makes everything run a little faster, including the kernel which has a bit more time to repair small problems before they grow too large.
- Purposely de-tune DansGuardian
If DansGuardian doesn't handle requests quite as quickly, the kernel will have more time to cover its errors before they get out of hand. Whatever you did to improve DansGuardian performance, undo parts of it.
- Downgrade Squid
There's been a report that downgrading from a Squid 3.x version to a Squid 2.x version made DansGuardian operation stable. Most likely this is because earlier versions of Squid either have a different pattern of memory usage or use fewer shared memory structures for IPC and so don't tickle the underlying problem. Regardless of exactly why it sometimes works, this relatively quick and easy possible solution is worth trying.
- Cause DansGuardian child tasks to stop and restart more frequently (or alternatively less frequently)
The idea behind stopping and restarting tasks more frequently is to reduce memory fragmentation and its subsequent problems. But restarting processes may not reduce memory fragmentation after all! So try this and see what happens. If there's no improvement, return the variables to their original values. Reduce minsparechildren, maxsparechildren, and preforkchildren. And reduce maxagechildren, perhaps to 300 or even 200. (Doing this will almost certainly have the side effect of reducing performance, perhaps noticeably.)
Sometimes the opposite change of stopping and restarting child tasks less frequently will improve stability. So also experiment with increasing minsparechildren, maxsparechildren, and preforkchildren, and greatly increasing maxagechildren, perhaps to 10000 or even 50000.
- Reduce the average load
Maybe you can tune other things or provide other capabilities so your users don't hit the web quite so hard. Or maybe you'll just have to change your users' behavior …if you can; if you're not sure, it may be better not to even try. Maybe users can be persuaded to drop their load just a few percent. But then again maybe they won't change and won't change and won't change until they suddenly drop out altogether (and even worse don't come back).
- Cap the peak system load
Use the DansGuardian maxips parameter to set a hard limit on how many computers can access the web at the same time. Set the number slightly lower than current peaks: high enough to not overly inconvenience users, but low enough to provide the desired reliability.
- Manually tune the kernel This is frequently required for BSD-derived kernels
Perhaps all you need to do is increase maxusers (kern.maxusers) beyond 256, as most other parameters are connected to it. (Don't do even just this one thing if you have a kernel that automatically adjusts maxusers at boot depending on how much RAM it finds.) If you want to get more detailed, consider increasing OPEN_MAX, or BUFCACHEPERCENT, MAX_KMAPENT, NKMEMPAGES, NKMEMPAGES_MAX and decreasing NMBCLUSTERS.
If just tailoring maxusers doesn't produce satisfactory results, you may need to go further and tune individual items. Pay attention to the number of network socket structures, the number of file descriptors, and the number of tasks. Especially (and perhaps surprisingly) pay attention to the number of shared memory structures. DansGuardian/Squid uses significantly more “shared memory” than most other server applications, including the IPC communication between DansGuardian and its backend proxy and the IPC communication between the DansGuardian parent and child processes.
Remember you're tuning for peak conditions (not average conditions). Even a performance monitoring tool that displays every second won't show conditions that last less than 100 milliseconds. Yet these short load spikes on an inadequately tuned kernel may be the cause of DansGuardian crashing.
Only change runtime values; do not rebuild a kernel (except as a last resort if you really really know what you're doing.) Rebulding OpenBSD kernels is no longer recommended (or even acceptable in most cases). Currently manually re-tuning a BSD-derived kernel often involves either the sysctl command or modifying the file /etc/sysctl.conf.
- Use an older version of DansGuardian
If you have a borderline case (DansGuardian doesn't fail very often, and just a very small improvement in reliability would be enough), an easy way to slightly de-tune DansGuardian may be to run an older version which does not include recent performance improvements.
(Note this may not be reasonably possible. All 2.10.x.x versions of DansGuardian provide such similar performance that changing would not be worthwhile. And the 2.8.x.x versions of DansGuardian are now several years old.)
- Add an auto-restart capability (for example a 'cron' auto-restart script)
Have a script wake up every few minutes and check if DansGuardian is still responsive. If not, stop both DansGuardian and Squid then start them fresh (Squid first then DansGuardian).
- Add a 'cron' periodic restart script
Have a script wake up once in a while (say every day or every few hours), and forcibly stop both DansGuardian and Squid then start them again (Squid first, then DansGuardian) no matter what. This will ensure that there's a properly functioning DansGuardian most of the time even if the server is unattended. It may however dramatically inconvenience users by aborting their web connection once in a while.