DansGuardian Documentation Wiki

You are here: Main Index » language_and_encoding_effects_on_phrase_matching


|

Wiki Information

Words calculated to catch everyone may catch no one.
Adlai E. Stevenson Jr.

Language and Encoding Effects on Phrase Matching

:!: This describes DansGuardian as it actually is (not necessarily as it could be). After reading this, you might conclude that:

  • DansGuardian and its defaults and phraselists tend to focus on the Roman alphabet, on very regularly spelled languages such as English, and on simple 8-bit Latin1-style character encodings
  • nevertheless DansGuardian can also be made to work with virtually any alphabet and language and character encoding (especially if one changes some of the configuration defaults and phraselists)
  • configuring DansGuardian to handle 16-bit encodings and non-Roman alphabets well can sometimes be a bit awkward

Overview of How It Works

Web Page Encoding

Sometimes a web page uses the HTTP content-language: header to advertise what language it's in. Sometimes the web page uses the equivalent HTTP <meta tag to advertise what language it's in. Sometimes the agent (browser) can figure out what language a web page is in by examining it closely. And all too often it's just plain not known what language the web page is in. Because nothing is ever certain (except death and taxes:-), DansGuardian doesn't even try to find out what language the web page is in. In effect DansGuardian never knows what language a web page is in, instead behaving in a way that it doesn't matter what language the web page is in.

Sometimes a web page uses the charset option of the HTTP content-type: header to advertise what character encoding it's in. Sometimes the web page uses the equivalent HTTP <meta tag to advertise what character encoding it's in. Sometimes the agent (browser) can figure out what character encoding a web page is in by examining it closely. And all too often it's just plain not known what character encoding the web page is in. Because nothing is ever certain (except death and taxes:-), DansGuardian doesn't even try to find out what character encoding the web page is in. In effect DansGuardian never knows what character encoding a web page is in, instead behaving in a way that it doesn't matter what character encoding the web page is in.

Phrase List Encoding

Each phrase list file is potentially stored using a different character encoding. To put the same thing another way, there's no requirement that all phrase list files be stored in the same character encoding.

DansGuardian does not know what character encoding has been used to store a phrase list file. In fact, you may not know either (unless you've added an accurate comment). There are no required marks in the file that identify which character encoding it uses. It's possible for a file to have a different character encoding even though it looks identical to another one. (for example some commands can convert the file to a different character encoding).

And the method of finding out what character encoding a file uses is different on different systems. Your system might provide either an encoding or a charset command. Or you may be expected to misuse either the iconv or the jv-convert command and glean the character encoding information from its error messages. Or various text editors may tell you the character encoding used by a file. Or you may try to interpret the low-level output of the od command. Or there may be some other method for you to determine the character encoding used by a file.

(Phrases containing only those characters with a value less than 128 –in other words characters that are part of the ASCII standard– will match both Latin1-style and UTF-8 encodings without needing to be specified twice. This is because the first 128 characters generate exactly the same raw byte sequence in both encodings.)

Comparison

As DansGuardian doesn't know either the language or the character encoding of either the web page or the phrase list file, it simply tests everything against everything as sequences of raw bytes (octets). The results of this brute force procedure turn out to be pretty reasonable. Although nothing forces it to turn out this way, experience is that usually only 8-bit encoded phrases match 8-bit encoded web pages and only 16-bit encoded phrases match 16-bit encoded web pages. There are occasional false matches, generally of either some 8-bit encoded phrase against a 16-bit encoded web page, or a 16-bit encoded phrase against an 8-bit encoded web page. But such false matches only happen once in a while and are usually not much of a problem.

Perhaps the most important corollary of how comparisons are done is that a phrase in one family of encodings will probably not match anywhere in a web page using a different encoding family. For example <foobar><50> in the ISO-8859-1 encoding will not match anything in a web page using the UTF-16 encoding, even though the words look the same when the web page is viewed in a browser. If you want a phrase word to match regardless of which encoding a web page may use, you will need to specify the phrase more than once.

DansGuardian's Encoding

If you've followed all the above, you will have understood that DansGuardian has no native nor preferred character encoding.

Repeating an important point of this document one more time: If you want the same phrase to match web pages in different character encodings (especially if the phrase contains accented/special characters), you may need to specify the phrase more than once. Specifying a phrase just once will not necessarily match all web pages no matter what character encoding each page uses.

Editing Phraselists

To Edit Or Not To Edit (apologies to Hamlet)

While adding your own words isn't terribly difficult, it can be quite error-prone and time-consuming and incomplete (not to mention an unnecessary duplication of effort) and is usually considered an “advanced” administrative function. It's probably better to make changes simply by commenting out (insert a '#' in the first column) or uncommenting (delete the existing '#' in the first column) existing lines in 'weightedphraselists'. Those lines refer to pre-set collections of words distributed as "phraselists" files.

At least for starters, you're probably better off simply turning whole existing categories of restricted words on or off rather than trying to add your own individual words. If you really want to edit your phraselists, read and understand Phraselist configuration and Editing as well as this document.

Controlling Phrase List File Encoding

Some text editors let you specify what character encoding to use when you “save” the file. If your text editor is one of these, this may be the simplest way to control the character encoding of phrase list files.

Other text editors insist on storing a revised file using the same character encoding as the original file used. In this case the easiest way to control the character encoding of a file may be to start with a copy of some other file that already uses the desired character encoding, and to if necessary replace all its contents.

A third way to control the character encoding of a file is to first edit the file in your system's preferred character encoding, then “convert” it to a different encoding using a tool such as iconv.

(It's even possible to create different lines with different character encodings and then use Linux text processing applications to splice them together into a single file. If doing this, especially beware of “cut & paste”, as sometimes a GUI in an attempt to be helpful will silently convert the clipboard to a different character encoding.

Distributed phraselists sometimes use the splicing technique. But doing so is not recommended for your own use, both because it can be even more confusing and because it can be quite difficult to edit further, so it effectively assumes the phrase list file will never change. Things will work better for you if –as many software tools assume– each complete file uses just one character encoding.)

Why Multiple Encodings

If you want the same phrase to match web pages in different character encodings, you may need to specify the phrase more than once.

In many cases this never comes up, because in many cases all web pages in the particular language of interest are also in just one character encoding. There are some exceptions though. For example Japanese web pages may use various character encodings (Shift_JIS, EUC-JP, UTF-8, etc.) even though they're all in one language. For another example occasionally an English language web page will use the UTF-16 (Unicode with mostly 16-bit characters) character encoding.

A simplification that helps a lot is that in many cases Unicode is a superset of other encodings. So a phrase encoded once in UTF-8 and a second time in UTF-16 may match almost all web pages.

How Multiple Encodings

It's best to have only one character encoding in a file. So if for example you want to specify your Japanese phrases in three different encodings (say for purposes of this example Shift_JIS, UTF-8, and EUC-JP), you may wish to create some additional files and link them together with .Include's as follows:

# general/weighted_japanese

.Include</etc/dansguardian/lists/phraselists/general/weighted_japanese_shift_jis>
.Include</etc/dansguardian/lists/phraselists/general/weighted_japanese_utf-8>
.Include</etc/dansguardian/lists/phraselists/general/weighted_japanese_euc-jp>
# general/weighted_japanese_shift_jis

< bad >,< phrase ><50>
< another >,< baddie ><50>
# general/weighted_japanese_utf-8

< bad >,< phrase ><50>
< another >,< baddie ><50>
# general/weighted_japanese_euc-jp

< bad >,< phrase ><50>
< another >,< baddie ><50>

Upper Or Lower Case

Web Page Case

In order to implement case-insensitive comparisons, web pages are generally translated to all lower case letters before comparisons are performed. Rather than subcontracting this translation to the tolower() function in whatever version of the dynamic libc is installed on your system, DansGuardian does this conversion itself. The conversion is currently done in a way that doesn't differ either for the webpage being processed or for the locale of the DansGuardian system, and is most appropriate for a Latin1-like encoding.

(If the system tolower() function were used instead, the behavior of different DansGuardian systems might be different. Specifically, the behavior might be affected by i) the version of libc or its equivalent, ii) the settings of environment variables such as LANG and LC_ALL [and maybe also LANGUAGE] and iii) the global system language and encoding preferences. Even if the system tolower() function were used for this purpose, the behavior would not be affected by the character encoding a web page actually used.)

Results for web pages with encodings very different from Latin1-style may in some cases be somewhat dubious. Web pages in languages that do not have a clear definition of upper and lower case (Hebrew and Arabic come to mind), and web pages in 16-bit encodings (Chinese comes to mind) are the most likely to be handled incorrectly.

If case conversion of web pages is causing problems for you, you can control it with the preservecase option in dansguardian.conf. The default of preservecase=0 works correctly in most cases. But if you experience problems and suspect case conversion because for example most of your web pages use the Big5 character encoding, try setting preservecase=1 to completely skip web page case conversion. If you have adequate CPU power and are not overly plagued by false phrase matches, try setting preservecase=2 to perform all comparisons both ways, once before doing any case conversion on the web page and a second time after.

Phrase List Case

Usually the contents of all phrase list files are converted to all lower case during DansGuardian initialization. Even though it thus isn't strictly necessary in most cases, you should generally use only lower case letters to specify phrases. (This conversion to lower case during initialization applies specifically to weighted, banned, and exception phrases. It does not apply to regular expressions, which handle case insensitivity a different way.)

DansGuardian may either use the same fixed conversion as is used for web pages, or it may subcontract this translation to the tolower() function in whatever version of the dynamic libc is installed on your system (as affected by whatever encoding is specified in the environment of the DansGuardian process). To obtain consistent and fully correct results even when this conversion is subcontracted to the tolower() function, some Latin1-style encoding (ISO8859···, etc.) should be specified in the environment of the DansGuardian process.

If this case conversion of phrases is causing problems for you (for example you're storing your phrases in Big5 encoding and they are not being converted to lower case correctly), you can turn it off for each phrase file independently. Add #noconvert (exactly one sharp [not zero, not two] in the first column with no following space) near the top of each phrase list file you do not wish case conversion to be applied to.

Phrase Matching

The handling of web page case and the handling of phrase list case generally work together correctly. The phrase <Foobar> will be converted to <foobar>, and a web page containing foobar Foobar FOOBAR will be converted to foobar foobar foobar. This will result in all three words matching; the net result of these case conversions is the desired case-insensitive phrase comparisons.

Phrases In The Log

Matching phrases are typically included in the DansGuardian access.log. Matching phrases are passed to a system print function without any indication of what character encoding they use. Depending on what character encoding the matching phrase is in and how the system is configured and how well the system print function guesses, the phrase words may not be interpreted sensibly when they are entered into or extracted from access.log. In other words, sometimes some of the matching phrases in access.log will appear to be nothing more than gibberish (and furthermore may appear differently on different systems). This can be inconvenient for the system administrator. But fortunately problems in logging do not affect the behavior of DansGuardian at all. Even though some log entries may appear to be partially nonsense, DansGuardian nevertheless did the right thing in those cases.

Particular Problems

In theory DansGuardian can filter any language. It does not do any sort of language or encoding detection; it simply compares binary bits. But in practice some languages are more difficult to filter than others. Known problematic cases include:

  • Phrase filtering of languages that typically use a 16-bit encoding (such as Chinese in “Big-5”) sometimes cause a lot of false matches against English content. Several techniques are typically used to minimize these problems:
    • In networks where users can only read Romance languages, consider not enabling phrase filtering of other languages that typically use a 16-bit encoding. (Note though this may allow webpages containing naughty images embedded in incomprehensible text to slip through the phrase filter. One way to avoid this is to ensure naughty sites that use other languages are listed in bannedsitelist.)
    • In any case, try to limit phrase filtering to no more than one language that typically uses 16-bit encodings. To say the same thing a different way, try not to enable phrase filtering for more than one language that typically uses 16-bit encodings.
  • Phrases in a language that typically uses a 16-bit encoding (such as Chinese in “Big-5”) are sometimes not recognized at all even though they have been entered correctly. These techniques –perhaps used in combination– should improve these sorts of problems:
    • Add #noconvert (exactly one sharp [not zero, not two] in the first column with no following space) near the top of each phrase list file containing phrases that are being mis-identified (or not being identified at all). You might also need to then specify the phrases multiple times with different explicit capitalizations, for example <foobar><50> / <Foobar><50> / <FooBar><50> / <FOOBAR><50>.
    • In dansguardian.conf specify preservecase = 2. (This may degrade performance and/or increase the false positives rate.)
    • Specify a Latin1-style encoding in the environment where DansGuardian runs. For example modify /etc/init.d/dansguardian so the program to be started is specified as something like LANG=en_US.iso885915 /sbin/dansguardian.
  • Phrase filtering of languages that do not have simple upper-case/lower-case conventions can be particularly difficult. You may be able to get it to work anyway by manipulating the pereservecase setting in dansguardian.conf. On the other hand you may conclude that phrase filtering of languages such as Arabic is so difficult that it isn't worth attempting.
  • The current (Dec 2009) phraselists arguably do not cover Hindi words thoroughly, so DansGuardian may require extensive additional configuration for use in India. Rather than expending considerable effort to build Indian phraselists pretty much from scratch, try to use as your starting point phraselists from some other Indian installation of DansGuardian.
  • Phrases containing accented/special characters may need to be specified twice to match against both web pages using a Latin1-style encoding and web pages using the UTF-8 encoding. Specify the phrase the first time in a phrase file using the first encoding, and the second time in a phrase file using the second encoding. (Phrases containing only simple characters need only be specified once. This is because the first 128 characters in a Latin1-style encoding and the UTF-8 encoding generate the same raw byte sequence.)
  • Even after you do everything right, phrases containing accented/special characters may fail to match pages that use the UTF-8 encoding. Either of these workarounds may solve the problem:
    • Specify a Latin1-style encoding in the environment where DansGuardian runs. (Don't worry that you'll be restricting filtering to only web pages in a Latin1-style encoding - you won't - there's no direct relation between the encoding used by DansGuardian and the encoding used by a web page.) For example modify /etc/init.d/dansguardian so the program to be started is specified as something like LANG=en_US.iso885915 /sbin/dansguardian.
    • In dansguardian.conf specify preservecase = 2. (In addition to solving the problem, doing this may degrade performance and/or increase the false positives rate.)