DansGuardian Documentation Wiki

You are here: Main Index » phraselists


|

Wiki Information

Phraselists

Phraselists are very easy to edit. Just open the phrase list files in your favourite editor and follow the examples below. (Note though that creating phrase lists for foreign languages can be somewhat arcane. Before attempting to create phrase lists for foreign languages, you should thoroughly understand the material in Language and Encoding Effects on Phrase Matching.)

Even if you're not a programmer you can help the DansGuardian project by contributing changes to existing phraselists and/or new list of phrases. Send new lists or changes by email to: phrasemaster {at} dansguardian {dot} org

How the Phrases Work

There are two different types of phrase lists that can be used with DansGuardian: banned phrases and weighted phrases. In many cases you should restrict all your activity to 'weightedphraselist' (and the files it .Include's) and not touch 'bannedphraselist' (and the files it .Include's) at all.

Banned Phrases

These phrases cause a web page to be blocked immediately with no possibility of mitigation whenever any banned phrase is found on the page. For example if you put the word sex in the banned phrase file, then pages containing that word will be blocked even if the page contains other legitimate information. For this reason, banned phrases should be used very sparingly.

Changes here are quite risky, because even a very small change can easily have an unintended consequence that poisons your entire system.

Weighted Phrases

These phrases are each assigned a point score and are combined by DansGuardian to assign a total score for the web page being accessed. The total score required for DansGuardian to block the page is configured in dansguardian.conf (DG 2.8 and earlier) or in dansguardianf(x).conf (in DG 2.9 and later). This system allows pages to be blocked much more accurately.

For example a web site containing the phrases sex and education would be allowed through the filter because the phrase education would be assigned a negative score to cancel out the score caused by the phrase sex.

Weighted Phrase Examples

Here are some examples of phrases and what the effect will be.

<slut><10> - Adds 10 to the count against the string 'slut'. ie. sluts, slut!, abslutxyz.

< slut ><10> - Adds 10 to the count against the word 'slut'. ie. Sally is a slut that smells.

<slut>,<horny><50> - Adds 50 to the count when the strings 'slut' and 'horny' are found on the same page.

<breast>,<medical><-30> - Subtracts 30 from the count when 'breast' and 'medical' are on the one page.

<education><-25> - Subtracts 25 from the count when 'education' is on the page.

Enabling and Adding Phraselists

You can enable new or existing phraselists by adding the phraselist's path to your weightedphraselist file. You can disable one of the lists by adding a # character to the fron to the line. See below for an example.

## Extra weighted-list files to include
.Include</etc/dansguardian/weightedphraselist.topic1>
#.Include</etc/dansguardian/weightedphraselist.topic2>

To obtain the latest available phraselists, you can visit the following link: [http://contentfilter.futuragts.com/phraselists/].

The phraselists include portions for quite a few different languages. Since each enabled language inevitably adds to the number of “false positives”, enable only those languages you really need to be filtered.

Unexpected Results for Weighted Phraselists

If you make changes to the phraselist files and do so incorrectly, the following may happen. Use the following as a guide if you see strange banned words show up when browsing the web.

<keyword><0> - score set to zero is treated as a banned word

<keyword> <50> - a space between the word and score is treated as a banned word

<keyword> this is misc text <50> - extra text between word and score treats <keyword> as a banned word

<keyword> - treated as a banned word

<keyword1>,<keyword2><keyword3><50> - missing a comma - treats <keyword1>,<keyword2> as a banned phrase, ignores <keyword3>

Phraselist Format Conventions

There are several conventions that we use when building the phraselists - both for consistency and to make things work.

  • Use UNIX style line returns (LF Only)
  • Put one blank line at the end of the file
  • Use lower case for all phrases except encoded ones
  • Include the following information at the top of the file
    • Phraselist description (i.e. intended purpose, etc.)
    • Intended character encoding(s) (ex: ISO8859-15, UTF-8, etc.)
    • Whether any accented letters, special symbols, etc. (i.e. greater than 128) are used anywhere
    • Original creator
    • Sponsor if appropriate
    • Any additional reference information (such as names of very closely related files)
  • Below the header put the #listcategory name to identify which phraselist file the phrase came from when a page is blocked.
#
# Phraselists to block japanese pornographic and explicit sites.
# Originally Created by Fernand Jonker 
# Sponsored by Eric Duveau
#
# Some words and meanings from 
# http://en.wikipedia.org/wiki/List_of_Japanese_sex_terms
# http://en.wikipedia.org/wiki/Pornography_in_Japan
# 
# Note that all phrases are in UFT-8, EUC-JP and SHIFT_JIS character sets. 

#noconvert
#listcategory: "Pornography (Japanese)"

#Block (very large weights) Un-filtered character sets
#<x-sjis><250>          #New character set - not used much
#<ISO-2022-JP><250>     #Used primarily for email
#<CP932><250>           #Unknown 

...

Searching Phraselists

If a phrase is blocking pages you need to access, you can search the phraselists from the command line using the following command:

grep -R -i "word you are looking for" /etc/dansguardian/phraselists

(Depending on your system's capabilities and configuration, this search method might not find words stored in uncommon character encodings.) Some versions of the DansGuardian Webmin Module provide a similar phrase search capability, (with the same character encoding limitations as this command.)

This command will output something like:

# grep -R -i "homo" /etc/dansguardian/phraselists/
/etc/dansguardian/phraselists/pornography/weighted:<best homosexual><50>
/etc/dansguardian/phraselists/pornography/weighted:<homoerotic><20>
/etc/dansguardian/phraselists/pornography/weighted:<homosexual><5>
/etc/dansguardian/phraselists/pornography/weighted_german:<Homo><20>
/etc/dansguardian/phraselists/pornography/weighted_french:< homosexuel ><15>
/etc/dansguardian/phraselists/pornography/weighted_portuguese:< homosexual><10>
/etc/dansguardian/phraselists/pornography/weighted_portuguese:< homossexual><10>
/etc/dansguardian/phraselists/googlesearches/banned:<?q=homosexual+women&>
/etc/dansguardian/phraselists/googlesearches/banned:<?q=homosexual+girls&>
/etc/dansguardian/phraselists/googlesearches/banned:<?q=homo&>
/etc/dansguardian/phraselists/googlesearches/banned:<?q=+homosexual+women>
/etc/dansguardian/phraselists/googlesearches/banned:<?q=+homosexual+girls>
/etc/dansguardian/phraselists/googlesearches/banned:<?q=+homo>
/etc/dansguardian/phraselists/googlesearches/banned:<?q=+homosexual+women+>
/etc/dansguardian/phraselists/googlesearches/banned:<?q=+homosexual+girls+>
/etc/dansguardian/phraselists/googlesearches/banned:<?q=+homo+>