mod_webfilter: an Apache2 content filter module

Even though the free software community opposes censoring efforts on the Internet, corporations still prefer to restrict the access their employees have to the interet with technical measures. Flexible solutions for URL filtering have so far mostly been provided by commercial software vendors (e.g. WebSense, free software implementations usually were not versatile enough for commercial environments. Unfortunately, this also meant that commercial products were used for proxies and firewalls, and sadly for many other things as well (WebSense does not work with Squid or Apache).

We believe that mod_webfilter could improve this. In the webfilter databases, each hostname or domain suffix is categorized with one or more categories describing the content the site offers. The administrator can configure the module with a whitelist and a blacklist. If the hostname requested by a user has a category as specified in the whitelist, the request is accepted, even if the following blacklist test would reject it. If the hostname requested has a category as specified in the black list, it is rejected. All other requests are accepted, but tag filtering, an additional capability of the module, is applied to the content delivered for this URL.

In principle, one database for blacklist and whitelist would be enough, nevertheless the module allows the databases for whitelist and blacklist to be different. This makes sense e.g. in a setting where the blacklist is imported from some public database of `indecent' URLs, and the whitelist is a locally maintained database of exceptions.

Szenarios

Here are some example szenarios how mod_webfilter can be used

The success of censoring endeavors critically depends on the contents of the blacklist. I thus propose that a web site be put up for people to contribute their black lists: every proxy administrator probably has a list of sites to block, collected by matching certain patterns against the proxy logs. Users of mod_webfilter could then download this database and merge it with their own databases.

How to get

mod_webfilter was written and is currently maintained by Andreas Müller. It is distributed under the GNU GPL has its home at http://software.othello.ch/mod_webfilter. The current version is 0.6, it can be downloaded from http://software.othello.ch/mod_webfilter/mod_webfilter-0.6.tar.gz.

Installation

Before installing mod_webfilter make sure you have GDBM installed, which you can get from any GNU mirror. Due to problems with naming clashes inside apache, we have to use GDBM, which uses a unique set of function names.

To install mod_webfilter, unpack the distribution in a suitable place, configure it with the familiar ./configure --with-apache=/path/to/your/apache/dir and run make, followed by make install.

Configuration

A typical webfilter configuration could look like the following:

WebfilterWhitelist     /usr/local/apache2/conf/blackandwhitelist
WebfilterWhitelistCategories   ok
WebfilterBlacklist     /usr/local/apache2/conf/blackandwhitelist
WebfilterBlacklistCategories   sex
WebfilterMutexname      webfilter.mutex

WebfilterTags           script
WebfilterRegexp         <script[^>]*src=[^>]*>
WebfilterRegexp         <noscript[^>]*>
WebfilterRegexp         </noscript[^>]*>
<Proxy *>
    Order deny,allow
    Deny from all
    Allow from all
    SetOutputFilter WEBFILTER
</Proxy>
This configuration allows unrestricted access to sites with category ok in the database /usr/local/apache2/conf/blackandwhitelist, and completely blocks access to sites with category sex in the same database. All other sites are filtered, script tags are remove together with the contents, script tags that reference an external source using the SRC attribute are overwritten with blanks, and noscript tags are also overwritten (they would prevent content from being displayed on script enabled browsers, and since the filter tries to turn the browser into something not script aware, it should also remove noscript tags).

The module also sports a status page which is delivered by the module's content handler. To access this information, add a location like

<Location /webfilter>
        SetHandler webfilter
</Location>
Note that public access to the location may reveal information about your filtering policy that you may not wont to publicize.

Database Management

The blacklist and whitelist databases are created an maintained with the command line utilities webfilter_create, webfilter_add, webfilter_delete, webfilter_dump, which are documented in their respective manual pages. For the type databases, there is a single programm webfilter_type.

In addition, an admittedly simplistic PHP script webfilter.php is provided, the install process puts it in the directory specified with the (required) --with-htdocs directive. Note that the paths to your whitelist/blacklist databases must be configured in the script. The paths to the binaries are edited automagically into the PHP script.

Troubleshooting

At the moment, most problems arise from not setting the buffer size large enough. This usually leads to broken pages, incomplete content, or just empty documents. See the WebfilterBuffersize configuration directive below.

Performance Impact

The module reads ahead about 20kB from each page before it starts to deliver filtered content to a client. Without the filter, the client could start retrieving images from the remote site as soon as it has seen the IMG tags. On a modem line, downloading images starts about three seconds later than without the filter. An links typically found at corporate sites, the impact is negligible.

Note that the filter does not affect what is in the cache. Disabling the filter and reloading a page will most probably give the unaltered page from the cache.

Reference

Below you can find all the configuration directives introduced by the mod_webfilter module.

WebfilterWhitelist

Syntax: WebfilterWhitelist whitelistdb
Default: none
Context: server config, virtual host, directory

whitelistdb is the database of whitelisted hosts, any responses for URLs matching this database will not be processed by the filter. The matching function includes a check for the category stored with the hostname in the database, a match occurs only if one of the categories stored with the hostname concides one of the categories specified with the WebfilterWhitelistCategories (see below).

Note that whitelistdb must not include file name extensions like .pag or .dir usually present with NDBM databases.

WebfilterBlacklist

Syntax: WebfilterBlacklist blacklistdb
Default: none
Context: server config, virtual host, directory

blacklistdb is the database of blacklisted hosts, any responses for URLs matching this database will be blocked, unless they happen to be members of the whitelist. The whitelist always takes precedence.

WebfilterWhitelistCategories

Syntax: categories ...
Default: empty
Context: server config, virtual host, directory

A list of category names to be whitelisted. A match occurs only if one of the categories stored for the host matches on of the categories given here.

WebfilterBlacklistCategories

Syntax: categories ...
Default: empty
Context: server config, virtual host, directory

A list of category names to be blacklisted. A match occurs only if one of the categories stored for the host matches on of the categories given here.

WebfilterWhitetypes

Syntax: WebfilterWhitetypes whitelistdb
Default: none
Context: server config, virtual host, directory

whitelistdb is the base name of the database of whitelisted types. Responses with types in this database are accepted, and will not be filtered.

WebfilterBlacktypes

Syntax: WebfilterBlacktypes blacklistdb
Default: none
Context: server config, virtual host, directory

blacklistdb is the base name of the database of blacklisted types. Responses with types in this database, and types not matched by the whitelist type database will be rejected.

WebfilterStrict

Syntax: on | off ...
Default: off
Context: server config, virtual host, directory

By default, mod_webfilter just filters all hosts that are neither blacklisted nor whitelisted. If this options is turned on, mod_webfilter only accepts an URL if it is whitelisted, and it turns on filtering for this host.

WebfilterRegexp

Syntax: expr1 [ expr2 ]
Default: none
Context: server config, virtual host, directory

If only expr1 is given, replace every match by the same number of blank characters. If expr1 and expr2 are specified, replace everything between a match of expr1 and the first subsequent match of expr2 by the same number of blank characters. If more than one match is possible for expr1, the one with the longest regular expression is prefered.

Matches are limited in size to the buffer size specified with the WebfilterBuffersize directive. However, expr1 was matched, there is no limit to how far away the match with expr2 is. However, no matches for other expressions are accepted until expr2 matches.

The WebfilterTags directive implicitely generates pairs of regular expressions.

WebfilterTags

Syntax: tagname ...
Default: none
Context: server config, virtual host, directory

Specify a list of tags that should be replaced by blanks. E.g. if you specify script, every occurence of the script tag will be replaced by blanks, including the content of the tag.

WebfilterBuffersize

Syntax: size
Default: 20000
Context: server config, virtual host, directory

To do the filtering, mod_webfilter reads part of the HTML document into an internal buffer and then matches on this buffer. Matches that are larger than this buffer are not found and thus not filtered. This directive allows to set the buffer size to a larger value.

Internally, this directive works by creating regular expression patterns for each tag, and processing them together with the expressions specified by the WebfilterRegexp directive. As an example, specifying the script tag as one to filter generates the expression <script[^>]/> and the pair of expressions <script[^>]> and </script[^>]>.

WebfilterShmname

Syntax: filename
Default: none
Context: server config

Shared memory is used to keep common statistics across apache processes, but the identification of shared memory segments depends on the operating system. Use this directive to provide a unique key for the shared memory to the operating system.

WebfilterMutexname

Syntax: filename
Default: none
Context: server config

A process mutex is used to mediate access to the shared memory segment, as with shared memory, mutexes need some platformdependent identification, often a file name. Use this directive to specify one.

Database Structure

This section is mainly intended for users building their own administrative front ends to the database.

Hosts

Entries a whitelist or blacklist host database are keyed by the hostname (or domain name prefixed by a period if domain matching is intended). They key in the NDBM database includes the trailing '\0' character, because that makes it easier to use the strings returned from the database directly (if the trailing nul is not included, a string has to be copied to a zero-terminated string before it can be used).

The value part of the entry is just the concatenation of nul-terminated strings, including the final '\0' character. So to scan through the set of categories returned in a datum structure named value, the following piece of code is used throughout mod_webfilter:

char	*categories = (char *)value.dptr;
for (int off = 0; off < value.dsize; off += strlen(categories + off) + 1) {]
	char	*category = categories + off;
	/* do something with the value pointed to by category */
}

Comments are stored in the same database, but their key is the hostname followed by a hash mark, which is not a legal character in a DNSname. This only increases the size of the database by a small fraction, and only if there are comments.

Types

The type database uses a simpler structure: the key and value are always identical, and are the name of the type as a C-string, including the trailing '\0'-character.


© Dr. Andreas Müller, Beratung und Entwicklung
$Id: mod_webfilter.html.in,v 1.9 2003/06/29 09:52:24 afm Exp $