This document explains in some detail several of the interesting mysteries of how Download Managers operate. In some areas its coverage is quite thorough, yet other areas aren't covered at all.
If you want instead to simply configure your system in one of the typical ways, consult Configuring Download Managers instead of this document.
For every file transferred from any webserver to any browser, a download manager is chosen and made ready to activate itself. During typical web browsing though you never notice this, as the download managers never actually become active.
(To repeat: A download manager will always be selected if the file type and configuration match and the file must be transferred [the file won't be transferred under various circumstances, mostly “not modified”]. But you'll only see a download manager for “large” files, and in most cases only when an anti-virus scanner is active. The download manager doesn't do anything until initialtrickledelay expires while the file is not completely handled yet; if DansGuardian completes transferring the file within initialtrickledelay seconds, the download manager will never do anything [and the user probably won't even realize it was ready].)
When a large file is being transferred and it must be anti-virus scanned first, there is a concern about the browser “timing out”, and there is a concern about keeping the user informed so they don't mis-conclude the browser is “hung”. These issues can be handled in more than one way, hence there are more than one download manager available. In fact there are three standard ones.
- The fancy download manager presents a progress bar to the user and updates it periodically. Then when the file has been completely transferred as far as DansGuardian and scanned as necessary, the fancy download manager presents to the user a hyperlink that will fetch the file from the disk where DansGuardian has temporarily stored it. This both keeps the user informed and presents the browser with regular bits of activity so it doesn't “time out”. The apparent superiority of this scheme is counteracted by some potential drawbacks:
- This may (but probably won't) cause an unacceptably high level of local network traffic.
- For unattended downloads such as software updates, the progress bar information not only is useless but also may actually corrupt the downloaded data. (A scheme to prevent such corruption by only activating the “fancy” download manager for interactive use is provided simply and is recommended. However it may require a bit of configuration attention to activate it; it doesn't always “just work”.)
- The requirement to click on an additional hyperlink may be seen by some users not as user friendliness but rather as a nuisance requiring extra training.
- The trickle download manager doesn't display anything to the user. It sends the next byte of data to the client every so often, thus hopefully preventing the client from “timing out”.
- This may be needed by some update programs, but is not as good as the “default” download manager with known interactive browsers.
- After even just one byte of data has been sent, it's not possible to display an error message page later. So with this Download Manager, if scanning finds a virus, the download will be terminated but there will be no indication at all to the user that there was a problem.
- The default download manager also doesn't display anything at all to the user during normal operation. It sends a fake HTTP header “X-DGKeep-Alive: on” to the browser every so often, thus in a different way preventing the browser from “timing out”.
- If scanning finds a virus, the download will be terminated and the user's browser will be directed to the “Access Denied” page so the user will be aware there was a problem.
For each transaction, DansGuardian goes through the list of available download managers in order asking each one if they should handle the request. The first affirmative response identifies the download manager that will be assigned to handle that request. If this process gets all the way to the last (bottom) download manager, the last download manager is forced to accept the request. (If it were allowed to respond negatively, there could be no download manager at all for the request, so the file transfer couldn't take place. This situation would not be acceptable and so isn't allowed to happen.) At least one download manager must be configured as being active.
First each download manager checks the “user agent string” that identifies the requester. Generally this string can be used to separate interactive browsers (IE, Firefox, etc.) from batch programs (wget, Curl, etc.). Most often it allows the “fancy” download manager to accept only interactive requests and pass batch requests on to the next available download manager.
Second each download manager checks the requested file extension and/or mimetype. If neither of these lists is configured for this download manager, everything is considered to match. The default arrangement is for all download managers to use the same lists/downloadmanagers/managedextensionlist and/or lists/downloadmanagers/managedmimetypelist (or none at all) so they all make similar decisions. (The similarity isn't required though. Each download manager could have its own control lists, although such a configuration would probably be difficult to maintain and isn't necessary.)
Typically the “default” download manager is last and is enabled. In its configuration typically useragentregexp, managedmimetypelist, and managedextensionlist are all left commented out.
If downloads are encouraged and anti-virus scanning is performed and user friendliness is paramount, the “fancy” download manager is also enabled. In its configuration typically useragentregexp is specified as mozilla (a value that matches most browsers and doesn't match most batch programs). Again managedmimetypelist and managedextensionlist may both be commented out (or if it's desired to use the “fancy” download manager only for certain types of files, one or the other of managedmimetypelist and managedextensionlist may be activated.)
Deciding what's an “upload” and what's not is pretty straightforward. If the HTTP operation is a POST, and its content length is at least 14, and its Mime type is not application/x-www-form-urlencoded, then the operation must be a file upload. DansGuardian can then apply any restrictions specified by maxuploadsize in 'dansgguardian.conf'.
There's no reliable way for a proxy (such as DansGuardian/Squid) to distinguish a “download” from a “non-download”. The proxy just sees the HTTP protocol, which indicates only that a file is being transferred. Whether the browser is going to store that file on disk or display it after it arrives is known only to the browser itself; it's not something DansGuardian can know. So why does DansGuardian have several ”…download…” options? The options are named that way in order to provide a more intuitive interface to the user …who in many cases doesn't even realize that DansGuardian is “faking it”.
Nothing more than a crude heuristic does a reasonable enough job of separating “download” and “non-download”. Users often don't realize a more exacting separation of “download” and “non-download” is not quite what's really going on. The two factors DansGuardian really uses to separate files are simply: i) file size, and ii) file type. For example if the file is fairly small (up to a few hundred kilobytes) and is HTML, it's a pretty good guess the operation is “non-download”. And if the file is quite large (at least a megabyte) and is .ZIP, it's a pretty good guess the operation is “download”.
(Note whether or not a file is anti-virus scanned has nothing to do with whether or not it's marked as a “download”.)
So, are lists/exceptionsitelist and lists/exceptionfilesitelist (or lists/exceptionextensionlist and lists/exceptionregexpurllist) really different, or are they just different ways of specifying the same thing?
It turns out they really are different, in the following two subtle ways:
- First, lists/exceptionsitelist applies to all transactions, whereas lists/exceptionfilesitelist applies only to files that are actually being transferred. So if you request a file that's already in the browser's cache, lists/exceptionsitelist will vett the transaction right away and allow it if the site is listed. Assuming the webserver then responds with “304 Not Modifed” (or any other 3xx return code), the response will be processed without ever checking lists/exceptionfilesitelist. Only if the return code is “200 OK” (or any other successful return code) will lists/exceptionfilesitelist then be checked.
- And second, lists/exceptionregexpurllist skips all checks including “phrase scanning” checks, whereas lists/exceptionextensionlist skips only the first set of checks (so the file is downloaded), but not the second set of checks (“phrase scanning” occurs anyway). (You might think of lists/exceptionextensionlist as being more analogous to lists/grey… than lists/exception….) So there's a difference between ”.css” in lists/exceptionextensionlist and “^[^?]*\.css(?:$|\?)” in lists/exceptionregexpurllist. The entry in lists/exceptionextensionlist ensures the file will always be downloaded, but does not skip the “phrase scanning” checks (which typically applies to all files whose mimetype begins “text/…”, including “text/css”). The entry in lists/execptionregexpurllist skips all checks including “phrase scanning” (which in this case is likely necessary).
There are three steps where an …extensionlist/…mimetypelist set of files can have an effect. Some files go through all three stages.
Files lists/bannedextensionlist, lists/bannedmimetypelist, lists/exceptionextensionlist, and lists/exceptionmimetypelist work together to decide whether a request will be allowed further or will be immediately denied. (Much of the content of these files is “fixed” by common sense. If you for example “banned” text/html files [and didn't except them], you could make web browsing completely impossible.)
The transferring of any files banned at this stage will be completely banned - the ban will apply to all transfers, not just downloads. Banning types of files that can only be found in a download can be very useful. But banning types of files that can sometimes be part of normal web browsing can have wide and unpleasant results (even though you have no anti-virus scanning and no user does anything they think of as a “download”).
Note these exception… files allow the request to proceed but do not disable phrase scanning. They're more analogous to 'greysitelist' and 'greyurllist' than to 'exceptionsitelist' and 'exceptionurllist'.
Every request is assigned a Download Manager (if the file is short the assigned Download Manager will never actually do anything and you'll never see it). The process of deciding which files will be handled by which Download Manager may be partly controlled by lists/downloadmanagers/managedextensionlist and lists/downloadmanagers/managedmimetypelist. (Sometimes only one file or the other will be used. And if neither file is activated, all requests are considered to match that Download Manager.)
If the file being transferred may be subject to anti-virus scanning, the files lists/contentscanners/exceptionvirusextensionlist, lists/contentscanners/exceptionvirusmimetypelist, list/contentscanners/exceptionvirussitelist, and lists/contentscanners/exceptionvirusurllist (modified by the setting of blockdownloads in dansguardianfN.conf) are consulted to see if the file should actually be anti-virus scanned or should be passed without scanning.
The initialtrickledelay option controls when the selected download manager will first make its presence known (perhaps by displaying a progress bar, or perhaps by sending a byte of data, or perhaps by sending a fake HTTP header). The value's effect is most obvious if the “fancy” download manager is being used, although it has less visible effects with the other downloadmanagers too. If the value is too short, users will sometimes see evidence of a downloadmanager even during regular non-download web browsing. If the value is too long, users may mistakenly conclude their browser is “hung” when they don't see anything happening.
The trickledelay kicks in as a periodic timer after initialtrickledelay. With the “fancy” download manager, it controls how often the progress bar is updated and hence is quite visible to users. With the “trickle” download manager, it controls how often the next byte of data is sent to the browser. With the “default” download manager, it controls how often a fake “X-DGKeepAlive: on” HTTP header is sent to the browser. Measure the time taken for a typical download, divide by ten, use that for the starting value of trickledelay, then tune the value as necessary. If the value is too short, local network bandwidth will be wasted with extraneous updates. If the value is too long, the progress bar will be overly jerky and you may experience problems with browser timeouts.
Almost all …extensionlist and …mimetypelist files effectively come in pairs. (The main exception is downloadmanasgers/managedextensionlist and downloadmanagers/managedmimetypelist. Often only one of these two files will exist and be used.)
As a general rule, if either the extension or the mimetype match, the whole transaction is considered to be a match. There's no requirement that both extension and mimetype match, and there's no complex handling of conflicting specifications. Just the single simple logical connective “or” completely describes the behavior of these pairs of list files.
Often only one or the other piece of information is available; DansGuardian often has no choice but to check only the one that exists. For requests from the browser, only the extension –in the path from the HTTP request– is available for checking. For responses from the website, often only the mimetype –which is including in the HTTP header– is available for checking. (In some cases a response will include an HTTP Content-Disposition: header which includes a filename.) After a request and its corresponding response are matched up, both an extension (from the response Content-Disposition: if available, otherwise from the request path in most cases) and a mimetype are available for checking.
The request file name extension does not always accurately describe what will be returned, especially if the “filename” isn't really the name of a file but rather of a program. Sometimes there's no extension at all. An example is Google, which appears to request a file named simply “search” with no extension. Sometimes the extension doesn't exactly match what will actually be returned. An example is foobar.php, which identifies a program that typically returns HTML.
To minimize confusion, it's usually best to keep pairs of …extensionlist and …mimetypelist files pretty much in sync. (You might be able to handle some situations specially with intentional differences between …extenionlist files and …mimetypelist files. But on the other hand it's very easy to shoot yourself in the foot doing this.)
(Keeping list pairs in sync can be easier said than done. There's not necessarily a one-to-one correspondence between extensions and mimetypes. For example the extensions ”.htm”, ”.html”, and ”.shtml” all correspond to the same mimetype: “text/html”. And for an example the other way, the mimetypes “application/zip” and “application/x-zip” (and perhaps others) all correspond to the same extension: ”.zip”. Sometimes the correspondence between an extension and a mimetype is not obvious. For example, given the mimetype “application/x-compressed”, would you know the corresponing extension was probably ”.tgz”? Sometimes different websites use different mimetypes for similar files. For example some websites will use “application/pdf” for the same file that other websites label with the mimetype “text/pdf”. And websites occasionally don't assign any mime type at all to some files.)
If the same thing is included in both a banned… extension or mimetype file and an exception… extension or mimetype file, the exception… listing takes precedence and the transaction is allowed. There doesn't seem to be any good reason for including something in both files though; in fact even though such dual listing doesn't confuse DansGuardian, it may very well confuse you.
Each Download Manager allows you to specify a useragentregexp = '…' to choose which browsers or client programs might use that particular Download Manager. Every web client program –including all browsers– has a “user agent string”; if the specified string occurs anywhere (the test is not case sensitive) within the user agent string, it's considered to be a match. It happens for odd reasons that the string “mozilla” will match almost all browsers. So specifying useragentregexp = 'mozilla' is likely a simple way to select all interactive client programs (i.e. browsers).
In some cases it may be necessary to specify a useragentregexp that matches several different client programs, which is pretty easy to do. For each client program, find out its “user agent string”, then select one fairly unique word out of it. Combine all the words (in all lower case) with the “pipe” (also called “or” and “vertical bar”) symbol (but no spaces), like this: useragentregexp = 'foobar|anotherword|thirdword'