The “access.log” written by DansGuardian contains a wealth of information. There's usually a separate entry for every single request (whether allowed or denied). It includes lots of information about the request, including a detailed explanation of why it was handled the way it was.
These logs are useful for many different things, including:
- troubleshooting (why is a particular request being handled the way it is?)
- providing information to users retrospectively (answering the user who asks “one of my requests was denied about 80 minutes ago, why?”)
- summarizing system usage (via a log analysis program)
- searching for patterns of abuse (for example requests from the same terminal about the same time almost every day, usually starting with a request to the a particular forum)
- detailed reporting of the previous day's system usage (what sites were visited? what searches were performed? what kinds of sites are being most frequently blocked?
- locating uses of open proxies
- finding out how users are locating open proxies
- providing an “audit trail” in support of disciplinary (or even legal) action
Several DansGuardian option settings change the contents of each log entry a little. Only a few of them cause more or ferwer entries, most just make changes inside each entry. And few of them change the order of the datums in the entry, all they do is add or subtract some datums or sometimes change the exact content of one.
The options that can have a massive effect are 'loglevel', 'logexceptionhits', and 'logfileformat'. 'loglevel' can omit all the “exception” entries or even eliminate the log altogether (although this isn't recommended). 'logfileformat=3' will eliminate many datums from a log entry, add a few, and rearrange the result so it looks like a “Squid” log. This might be useful for using Squid log analysis tools (but it's probably better to simply point the log analysis tool at the Squid stub logs); on the other hand it omits most DansGuardian-specific datums from the logs and so severely reduces their usefulness.
Blocks that result from a list with #listcategory “ADs” (note well the upper case A and D and lower case s, which are all required) are not included in the log (assuming the default “logadblocks = off” in dansguardian.conf). This allows the size of the logs to be (considerably) reduced even though some advertisements are being blocked.
If you might publicize the logs and are worried about exposing personally identifiable information, you can set an option (“anonymizelogs = on”) to suppress a few of these datums from ever getting into the logs in the first place. Note though that doing so makes the logs significantly less useful for several purposes (including troubleshooting). Decide what you want (or what's legally required); there's no one “right” answer worldwide. Even consider flipping the option back and forth for different purposes: 'off' during troubleshooting and 'on' during production.
Log rotation (including compression and deletion) usually happens automatically using whatever mechanism your distribution supplies. If you identify a particular log as containing information you might need in the future (for example evidence of an internal attempt to bypass DansGuardian), copy that file to another location. Don't expect the original to still be there when you look again tomorrow or next week.
A log entry looks more daunting than it really is, partly because there are no legends, partly because unknown or irrelevant fields are often just skipped (so it doesn't work to for example “get the 11th entry”), and partly because sometimes a single field contains several bits of data. But it's really not so bad: the bits of information (if present) are always in the same order, and several bits of information simply by their form are obvious markers. For example the HTTP “return code” is always exactly a positive three digit number, something that sticks out clearly).
The datums in one log entry (in order) are:
- requesting user or computer
- if an “authplugin” has identified a user or computer, otherwise just a dash
- requesting IP address
- (watch out for DHCP networks where computers sometimes change IP addresses)
- complete requested URL
- often much of this is hidden from the user
- typically includes search terms
- items like *URLMOD*, *CONTENTMOD*, *SCANNED*, *INFECTED*, ending with either *DENIED* or *EXCEPTED* (*URLMOD* means urlregexplist tweaked the outgoing request, often used to force “safesearch” on) (*CONTENTMOD* means contentregexplist tweaked the incoming content, sometimes used to replace ofensive words with less offensive ones [but its use probably interferes with downloads, thus precluding them])
- an elaboration on the action
- more details about the action, for example the actual regular expressions
- the HTTP request verb, usually either GET or POST (or HEAD)
- the size in bytes of document (if it was fetched)
- the sum of all the weighted phrase scores, which is the calculated naughtyness value
- contents of the #listcategory tag (if any) in the list that's most relevant to the action
- filter group number
- the filter group (1 ⇒ f1, 2 ⇒ f2, etc). the request was assigned to
- HTTP return code
- always a three digit number, usually 200 if everything went okay
- the MIME type of the document according to the website, usually “text/html” for webpages
- if configured, the result of performing a reverse DNS IP lookup on the requestor's IP address
- highly network dependent, meaningful on only some networks
- filter group name
- a more convenient presentation of the same information in filter group number
- only present if “groupname = …” is specified in each dansguardianfN.conf file
- browser “user agent” string
- sometimes interesting and useful information
- (note though that because this is so easily spoofed, it should not be used for any sort of security)