Like most *nix software, DansGuardian uses "regular expressions" for identifying bits of text that are of interest, both in the URL itself and in the returned content body. During basic configuration of DansGuardian you won't come in contact with regular expressions at all. But configuring many of the more esoteric DansGuardian features involves regular expressions.
“Regular expressions” originated in the field of neurophysiology and were originally mathematically codified as “regular sets”. But that interesting factoid is disjoint from the actual computer use of “regular expressions” (in fact most computer “regular expressions” don't even meet the mathematical requirements of “regular sets” any more). Linux (and hence DansGuardian) “regular expressions” are simply an extremely compact (some would say cryptic:-) pattern matching language used for identification and extraction of interesting bits of text.
Regular expressions generally provide more than one way to say the same thing (or at least very similar things). Different DansGuardian users will typically come up with slightly (or even significantly) different regular expressions to perform the same function. Regular expressions have several considerations beyond just working correctly in the typical case. Do they continue to work correctly (or at least fail gracefully) even when presented with unexpected and wildly incorrect text to match against? Do they provide good performance in all cases (matching input, non-matching input, incorrect input)? Are they easily understood so they can be maintained?
Some folks thrive on understanding and tweaking regular expressions; but many others do not! It's fairly common to see regular expressions treated gingerly. A few folks, when faced with a regular expression that doesn't work right, simply throw it away and start all over. Although this is not recommended, it's certainly empathized with.
Because they're so compact (hence hard to understand), including copious comments with regular expressions is a very good idea. PCRE even offers a way to embed comments right within the regular expressions thenselves rather than segregating the comments into separate lines (this capability for embedding comments within regular expressions isn't supported until DansGuardian version 188.8.131.52 though).
So, DansGuardian regular expressions behave a little differently depending on which regular expression library it's using. A non-PCRE version of DansGuardian may treat regular expressions slightly differently on different machines (although most libraries follow the POSIX Extended Regular Expression definition closely and don't actually differ much). A PCRE version of DansGuardian switches into a different (yet upward-compatible) family of regular expressions. It may behave a little differently though depending on which exact version of the PCRE libraries DansGuardian uses at runtime. This variability between systems is why there is generally no one right answer to DansGuardian regular expression-related questions.
If a regular expression includes the sequence (?, using that regular expression will require PCRE. Regular expressions without the sequence (? can be understood and used equally well by both non-PCRE and PCRE DansGuardians.
If given a choice, it's generally better to build a PCRE version of DansGuardian. Everything will be there so any features that might require PCRE regular expressions will work. Your own regular expressions could initially artificially restrict themselves to the simpler POSIX extended regular expressions, then later grow to use the additional constructs provided by PCRE without requiring any change to DansGuardian.
One use of regular expressions in DansGuardian is to pattern match against the URLs typically seen in HTTP GET operations. Below we'll focus on this use of regular expressions, and we'll assume that the full PCRE regular expression capability is available.
When requests first come in, DansGuardian parses the URLs into their component parts, then stores each part separately and references the different parts as necessary. While some parts are stored as strings, other parts -for example the port- are stored as numbers. Preparatory to pattern matching, DansGuardian reconstitutes a URL-like string from the components. This parse-and-reconstitute behavior canonicalizes (standardizes) URLs, but as a result the reconstituted URL-like string isn't always exactly the same as the original URL string.
The most noticeable difference when comparing the what appears in the DansGuardian logs to what appears in the browser's address bar involves the port number. DansGuardian follows the usual convention of omitting the standard port. So if the stored number is 80, :80 does not appear in the URL-like string regular expressions try to match against. Attempts in DansGuardian to match :80 with a regular expression will always fail (even if the user explicitly typed :80 in the browser's address bar).
The URL-like text DansGuardian matches your regular expressions against is similar to that seen in access.log, but it's not identical. The URL-like text against which patterns are matched differs from the original in important ways, all of which simplify the regular expressions that attempt to match against it. Be aware of these differences from what's in the log:
- The “protocol” is not present at the beginning of the URL. Rather, the URL begins with the actual host/site name.
- Hex character encodings are already interpreted (HTML entities are another matter). So regular expressions can simply try to match against actual characters without worrying too much about alternate representations of them.
- The URL text is all by itself, not embedded in a longer log entry. The beginning of the URL is also the very beginning of text, and the end of the URL is also the very end of text.
- Everything is treated as lower case. There's never any reason to allow for an upper case alternative, not even if the URL as presented in access.log contains upper case characters.
what you see in the log:
... http://Site.or.Host.name.tld%2FLong%20Path%2Fsubpath/filename%2EEXT ...
what the regular expression tries to match:
(Usually if there's interception it's only port 80, and DansGuardian always omits :80 because it's standard. So the net result is if you have a “transparent-intercepting” environment (see Two Configuration Families), you will never see a port number [http://host-or-site.name.tld:nn/…] in URLs you're matching against.)
We'll not try to completely explain regular expressions here, as there are a number of good tutorials on the net. We will though show you the form of some frequently used constructs in DansgGuardian. (If you're interested in digging deeply into all the gory details, see the book “Mastering Regular Expressions”, by Jeffrey E.F. Friedl, published by O'Reilly, ISBN 0-596-52812-4, ISBN13 978-0-596-52812-6.)
[If the very first character of a regular expression is slash /, to prevent a common misinterpretation we enclose the slash in parentheses (/) or (?:/).]
[For slightly better performance, we use non-capturing parentheses, which only provide grouping but don't do anything else. They are written as (?:x) rather than just (x). This is only possible with PCRE; in non-PCRE versions of DansGuardian just use regular capturing parentheses (x) instead.]
To match “foobar” anywhere in the entire URL:
To match “foobar” only if it appears as the beginning or ending of a word anywhere in the entire URL:
To match “foobar” only if it's a whole word anywhere in the entire URL:
To match “foobar” anywhere in the host/site name:
To match “foobar” only if it appears as the beginning or ending of a part of the host/site name:
To match “foobar” only if it's a whole part of the host/site name:
To match “foobar” anywhere in the path name only:
To match “foobar” only if it appears as the beginning or ending of a part of the path name:
To match “foobar” only if it's a whole part of the path name:
To match “foobar” anywhere in either the host/site name or the path name:
To match “foobar” only if it appears as the beginning or ending of a part of either the host/site name or the path name:
To match “foobar” only if it it's a whole part of either the host/site name or the path name:
Matches against the “query” part of the URL (the part after the ?) are also possible, but are not included here because they're much less frequently used, can be a little more complex, and are subject to misinterpretation if a URL has been embedded as a query parameter inside another URL.
(The \b construct is specific to PCRE. Non-PCRE regular expressions can do the same things, but they will be more complex.)
Another use of regular expressions in DansGuardian is to provide intelligent replacement of text, either in the URL (urlregexplist) or in the returned page content (contentregexplist). Note that for this use, the “protocol” part is present at the beginning of the URL (unlike for bannedregexp… or exceptionregexp… matching).
Here matched pieces of the original text are available for use as parts of the replacement: the contents of the first set of parentheses is available as \1, the contents of the second set of parentheses is available as \2, and so forth, and the content of the entire match is available as &. The PCRE library is hosted inside DansGuardian rather than inside Perl itself, so even though PCRE stands for Perl-Compatible…, the Perlish replacement parameters $1, $2, $&, etc. are not required (nor is $& even possible).
For these uses the pattern matching regular expression is enclosed in double quotes ” to the left of →, and the replacement pattern (probably using parts captured by the match to the left) is enclosed in double quotes ” to the right of → (don't insert any white space anywhere).
Here's an example of tweaking a URL by replacing some bits of it in such a way as to force Google “safe search” to be on:
# first remove any possible existing safe=… parameter (which might say safe=off)
# then in any case add safe=on as the first query parameter
(The above two-step process to tweak the URL is straightforward and works even without PCRE. If PCRE features are available, non-greedy matching can be used to perform the whole operation in one step as follows, although the increased complexity of and risk of errors in the regular expression may not be warranted:)
# remove any existing safe=… parameter, and replace it with safe=on ”(^http://[^/]*\.google\.[^?]*\?)(.+?&)?(safe=[^&]*)?(.*$)”->\1\2safe=on\3”
Modifications to returned page content use a similar syntax. For example:
# play with user's minds a bit