mod_webfilter
: an Apache2 content filter moduleEven though the free software community opposes censoring efforts on the Internet, corporations still prefer to restrict the access their employees have to the interet with technical measures. Flexible solutions for URL filtering have so far mostly been provided by commercial software vendors (e.g. WebSense, free software implementations usually were not versatile enough for commercial environments. Unfortunately, this also meant that commercial products were used for proxies and firewalls, and sadly for many other things as well (WebSense does not work with Squid or Apache).
We believe that mod_webfilter
could improve this.
In the webfilter databases, each hostname or domain suffix is categorized
with one or more categories describing the content the site offers.
The administrator can configure the module with a whitelist and
a blacklist.
If the hostname requested by a user has a category as specified in
the whitelist, the request is accepted, even if the following blacklist
test would reject it.
If the hostname requested has a category as specified in the black list,
it is rejected.
All other requests are accepted, but tag filtering, an additional
capability of the module, is applied to the content delivered
for this URL.
In principle, one database for blacklist and whitelist would be enough, nevertheless the module allows the databases for whitelist and blacklist to be different. This makes sense e.g. in a setting where the blacklist is imported from some public database of `indecent' URLs, and the whitelist is a locally maintained database of exceptions.
Here are some example szenarios how mod_webfilter
can be used
mod_webfilter
gives a solution that allows configuration
changes on the fly.
mod_whitelist
can help solve
this problem.
mod_whitelist
blacklist can do this.
mod_whitelist
can do more: using the string matching
functionality, one can remove certain strings from the pages deliverd
to users. Personally, I oppose any sort of censorship: I'm perfectly
capable of thinking myself.
mod_webfilter
can easily be circumvented by users who use a HTTPS-based anonymizer,
so if you are serious about censoring, you should also censor all the
anonymizers. And finally: Internet abuse is usually leadership
problem or a psychological problem, technical measures do not
resolve the problem, but only mitigate a symptom.
mod_webfilter
, it is relatively
easy to strip script tags or active X controls from the HTML pages sent
to users.
mod_webfiler
allows one to match
strings in the HTML page sent to users, and replace it with some other
text (functionality not yet implemented, but on the TODO list).
mod_webfilter
could then
download this database and merge it with their own databases.
mod_webfilter
was written and is currently maintained
by Andreas Müller.
It is distributed under the GNU
GPL has its home at
http://software.othello.ch/mod_webfilter.
The current version is 0.6, it can be downloaded from
http://software.othello.ch/mod_webfilter/mod_webfilter-0.6.tar.gz.
Before installing mod_webfilter
make sure you have
GDBM installed, which you can get from any GNU mirror. Due to problems
with naming clashes inside apache, we have to use GDBM, which uses
a unique set of function names.
To install mod_webfilter
, unpack the distribution in a
suitable place, configure it with the familiar
./configure --with-apache=/path/to/your/apache/dir
and run make
, followed by make install
.
A typical webfilter configuration could look like the following:
WebfilterWhitelist /usr/local/apache2/conf/blackandwhitelist WebfilterWhitelistCategories ok WebfilterBlacklist /usr/local/apache2/conf/blackandwhitelist WebfilterBlacklistCategories sex WebfilterMutexname webfilter.mutex WebfilterTags script WebfilterRegexp <script[^>]*src=[^>]*> WebfilterRegexp <noscript[^>]*> WebfilterRegexp </noscript[^>]*> <Proxy *> Order deny,allow Deny from all Allow from all SetOutputFilter WEBFILTER </Proxy>This configuration allows unrestricted access to sites with category ok in the database /usr/local/apache2/conf/blackandwhitelist, and completely blocks access to sites with category sex in the same database. All other sites are filtered, script tags are remove together with the contents, script tags that reference an external source using the SRC attribute are overwritten with blanks, and noscript tags are also overwritten (they would prevent content from being displayed on script enabled browsers, and since the filter tries to turn the browser into something not script aware, it should also remove noscript tags).
The module also sports a status page which is delivered by the module's content handler. To access this information, add a location like
<Location /webfilter> SetHandler webfilter </Location>Note that public access to the location may reveal information about your filtering policy that you may not wont to publicize.
The blacklist and whitelist databases are created an maintained with
the command line utilities webfilter_create
,
webfilter_add
, webfilter_delete
,
webfilter_dump
, which are documented
in their respective manual pages. For the type databases, there is a
single programm webfilter_type
.
In addition, an admittedly simplistic PHP script webfilter.php is provided,
the install
process puts it in the directory specified with the (required)
--with-htdocs
directive. Note that the paths to your
whitelist/blacklist databases must be configured in the script.
The paths to the binaries are edited automagically into the PHP
script.
At the moment, most problems arise from not setting the buffer size large enough. This usually leads to broken pages, incomplete content, or just empty documents. See the WebfilterBuffersize configuration directive below.
The module reads ahead about 20kB from each page before it starts to deliver filtered content to a client. Without the filter, the client could start retrieving images from the remote site as soon as it has seen the IMG tags. On a modem line, downloading images starts about three seconds later than without the filter. An links typically found at corporate sites, the impact is negligible.
Note that the filter does not affect what is in the cache. Disabling the filter and reloading a page will most probably give the unaltered page from the cache.
Below you can find all the configuration directives introduced by the
mod_webfilter
module.
whitelistdb is the database of whitelisted hosts, any responses for URLs matching this database will not be processed by the filter. The matching function includes a check for the category stored with the hostname in the database, a match occurs only if one of the categories stored with the hostname concides one of the categories specified with the WebfilterWhitelistCategories (see below).
Note that whitelistdb must not include file name extensions like
.pag
or .dir
usually present with NDBM databases.
blacklistdb is the database of blacklisted hosts, any responses for URLs matching this database will be blocked, unless they happen to be members of the whitelist. The whitelist always takes precedence.
A list of category names to be whitelisted. A match occurs only if one of the categories stored for the host matches on of the categories given here.
A list of category names to be blacklisted. A match occurs only if one of the categories stored for the host matches on of the categories given here.
whitelistdb is the base name of the database of whitelisted types. Responses with types in this database are accepted, and will not be filtered.
blacklistdb is the base name of the database of blacklisted types. Responses with types in this database, and types not matched by the whitelist type database will be rejected.
By default, mod_webfilter
just filters all hosts that
are neither blacklisted nor whitelisted. If this options is turned on,
mod_webfilter
only accepts an URL if it is whitelisted,
and it turns on filtering for this host.
If only expr1 is given, replace every match by the same number of blank characters. If expr1 and expr2 are specified, replace everything between a match of expr1 and the first subsequent match of expr2 by the same number of blank characters. If more than one match is possible for expr1, the one with the longest regular expression is prefered.
Matches are limited in size to the buffer size specified with the WebfilterBuffersize directive. However, expr1 was matched, there is no limit to how far away the match with expr2 is. However, no matches for other expressions are accepted until expr2 matches.
The WebfilterTags directive implicitely generates pairs of regular expressions.
Specify a list of tags that should be replaced by blanks. E.g. if you specify
script
, every occurence of the script tag will be replaced by
blanks, including the content of the tag.
To do the filtering, mod_webfilter
reads part of the HTML
document into an internal buffer and then matches on this buffer.
Matches that are larger than this buffer are not found and thus not
filtered. This directive allows to set the buffer size to a larger
value.
Internally, this directive works by creating regular expression patterns
for each tag, and processing them together with the expressions specified
by the WebfilterRegexp directive. As an example, specifying the script tag
as one to filter generates the expression <script[^>]/>
and the pair of expressions <script[^>]>
and
</script[^>]>
.
Shared memory is used to keep common statistics across apache processes, but the identification of shared memory segments depends on the operating system. Use this directive to provide a unique key for the shared memory to the operating system.
A process mutex is used to mediate access to the shared memory segment, as with shared memory, mutexes need some platformdependent identification, often a file name. Use this directive to specify one.
This section is mainly intended for users building their own administrative front ends to the database.
Entries a whitelist or blacklist host database are keyed by the hostname (or domain name prefixed by a period if domain matching is intended). They key in the NDBM database includes the trailing '\0' character, because that makes it easier to use the strings returned from the database directly (if the trailing nul is not included, a string has to be copied to a zero-terminated string before it can be used).
The value part of the entry is just the concatenation of nul-terminated strings, including the final '\0' character. So to scan through the set of categories returned in a datum structure named value, the following piece of code is used throughout mod_webfilter:
char *categories = (char *)value.dptr; for (int off = 0; off < value.dsize; off += strlen(categories + off) + 1) {] char *category = categories + off; /* do something with the value pointed to by category */ }
Comments are stored in the same database, but their key is the hostname followed by a hash mark, which is not a legal character in a DNSname. This only increases the size of the database by a small fraction, and only if there are comments.
The type database uses a simpler structure: the key and value are always identical, and are the name of the type as a C-string, including the trailing '\0'-character.