DansGuardian Documentation Wiki

You are here: Main Index » pattern_matching


|

Wiki Information

Differences

This shows you the differences between the selected revision and the current version of the page.

pattern_matching 2008/11/07 15:39 pattern_matching 2008/11/07 15:48 current
Line 71: Line 71:
The URL-like text against which patterns are matched The URL-like text against which patterns are matched
differs from the original in important ways, differs from the original in important ways,
-all of which simplify the regular expressions that attempt to match against it.+all of which simplify the regular expressions that attempt to match against it. Be aware of these differences from what's in the log:  
 + 
 +  - The "protocol" is not present at the beginning of the URL. Rather, the URL begins with the actual host/site name.  
 +  - Hex character encodings are already interpreted (HTML entities are another matter). So regular expressions can simply try to match against actual characters without worrying too much about alternate representations of them. 
 +  - The URL text is all by itself, not embedded in a longer log entry. The beginning of the URL is also the very beginning of text, and the end of the URL is also the very end of text. 
 +  - Everything is treated as lower case. There's never any reason to allow for an upper case alternative, not even if the URL as presented in access.log contains upper case characters. 
 + 
 +For example--\\  
 +what you see in the log:\\  
 +<html>&nbsp;&nbsp;</html>//<nowiki>... http://Site.or.Host.name.tld%2FLong%20Path%2Fsubpath/filename%2EEXT ...</nowiki>//\\  
 +what the regular expression tries to match:\\  
 +<html>&nbsp;&nbsp;</html>//site.or.host.name.tld/long path/subpath/filename.ext//
(Usually if there's interception it's only port 80, (Usually if there's interception it's only port 80,
Line 81: Line 92:
in URLs you're matching against.) in URLs you're matching against.)
-Difference 1) The "protocol" is not present at the beginning of the URL. Rather, the URL begins with the actual host/site name.  
- 
-Difference 2) Hex character encodings are already interpreted (HTML entities are another matter). So regular expressions can simply try to match against actual characters without worrying too much about alternate representations of them. 
- 
-Difference 3) The URL text is all by itself, not embedded in a longer log entry. The beginning of the URL is also the very beginning of text, and the end of the URL is also the very end of text. 
- 
-Difference 4) Everything is treated as lower case. There's never any reason to allow for an upper case alternative, not even if the URL as presented in access.log contains upper case characters. 
- 
-For example--\\  
-what you see in the log:\\  
-<html>&nbsp;&nbsp;</html>//<nowiki>... http://Site.or.Host.name.tld%2FLong%20Path%2Fsubpath/filename%2EEXT ...</nowiki>//\\  
-what the regular expression tries to match:\\  
-<html>&nbsp;&nbsp;</html>//site.or.host.name.tld/long path/subpath/filename.ext// 
==== DansGuardian Regexp Tricks ==== ==== DansGuardian Regexp Tricks ====
We'll not try to completely explain regular expressions here, as there are a number of good tutorials on the net. We will though show you the form of some frequently used constructs in DansgGuardian. (If you're interested in digging deeply into all the gory details, see the book "Mastering Regular Expressions", by Jeffrey E.F. Friedl, published by O'Reilly,  ISBN 0-596-52812-4, ISBN13 978-0-596-52812-6.) We'll not try to completely explain regular expressions here, as there are a number of good tutorials on the net. We will though show you the form of some frequently used constructs in DansgGuardian. (If you're interested in digging deeply into all the gory details, see the book "Mastering Regular Expressions", by Jeffrey E.F. Friedl, published by O'Reilly,  ISBN 0-596-52812-4, ISBN13 978-0-596-52812-6.)