Repel http:BL

by Terje Norderhaug

Repel http:BL is used with the Apache webserver to identify friendly search engines and detect malicious web bots such as email address harvesters and comment spammers. It accesses the DNS blacklist registry compiled by Project Honeypot to reliably determine the type and threat level of robots visiting your server. Repel can for example be used to:

Eliminating the flood of requests often made by malicious bots may improve the performance of web servers. Preventing harvesting of email addresses may lead to less spam. Blocking comment spammers from posting may reduce clutter on blogs and message boards.

Repel is free and comes with open source licensed under LGPL. It works with Python 2.3.5 or later.

Getting Started

Download the latest version of Repel. You can activate it for Apache webservers using rewrite rules:

  1. Open the configuration file for Apache.
  2. Enable the rewrite engine by placing the following in the configuration file:

    RewriteEngine on

  3. Declare a rewrite lock to synchronize communication with mapping programs:

    RewriteLock /path/to/Apache/rewritelock.lock

  4. Declare a rewrite map using Repel as mapping function:

    RewriteMap REPEL "prg:/path/to/scripts/Repel/repel.py --key=honeypotkey"

    Use your own key from Project Honeypot in place of the honeypotkey. You can alternatively insert the key in the options file of Repel.

  5. Define a condition for when to apply the rewrite rule, for example, when the IP address of the request matches any suspicious or malicious bot:

    RewriteCond ${REPEL:%{REMOTE_ADDR}|OK} Suspicious|Malicious

  6. Define a rule applied to clients that matches the condition to, for example, refuse access to all locations:

    RewriteRule ^.* - [F]

    Place this rule on the line immediately following the rewrite condition.

    You are encouraged to install your own honeypot and redirect harvesters and commentspammers to it. You can for example define a rewrite condition as above and use a rewrite rule like:

    RewriteCond ${REPEL:%{REMOTE_ADDR}|OK} Harvester|CommentSpammer
    RewriteRule ^.* /cgi-bin/honeypot.py [L]

Note that each virtual host definition needs to have a RewriteEngine on directive to enable rewriting. Optionally place an rewriteoptions inherit directive in the Apache virtual host definition to apply the rewrite rules of the main server. Keep in mind that the main rewrite rules are applied after the rewrite rules of the virtual host no matter the order in the configuration file.

For questions about configuration or other concerns, please visit the Repel support forum.

Response Format

Repel is technically a filter that reads IP addresses from input, looks each up as a DNSBL query from a DNS server, and emits the result in the same order as in the input, provided in a format suitable for regular expression matching. Start Repel with a --log option to log responses in a file so you can examine the format:

RewriteMap REPEL "prg:/path/to/scripts/Repel/repel.py --key=honeypotkey --log=httpbl.log"

When Repel identifies an IP address as a bot, it reponds with a code consisting of four pairs of hexadecimal digits, separated by colons (e.g. "7F:01:01:06"). The meaning of these numbers (as decimals) is described in the Project Honeypot API. For your convenience, the code is followed by a combination of descriptive keywords that decodes the result, so you can match these in the Apache rewrite conditions:

SearchEngine
An innocent, legit search engine. This keyword might be followed by an equal sign and a label identifying the specific engine, e.g. "SearchEngine=Google".
Suspicious
Has engaged in behavior that is consistent with a malicious bot, but malicious behavior has not yet been observed.
Malicious
Definitely a malicious bot.
Harvester
A bot harvesting email addresses.
CommentSpammer
A bot that automatically posts comments on blogs and web message boards.
Dormant=xx
How many days since the bot last visisted a honeypot, as two hex digits larger than zero. Not included if the bot has been active the past day.
Threat=xx
A rough hexadecimal measure of the threat the bot may pose to your site. Not included if the bot is not known to pose any threat.
Expired=mm:ss:nn
The DNSBL query timed out after minutes, seconds, milliseconds.

Search Engine Labels

These labels identify friendly search engines in the response:

Command Options

The command line can take the following options:

-k or --key
Access key provided by the DNSBL host.
-d or --domain
The domain name of the DNSBL host.
-f or --format
File or directory pathname defining the result format.
-l or --log
File pathname to log the DNSBL responses in a format usable by Apache text rewrite maps.
-t or --timeout
A decimal number denoting the time in seconds before an unsuccessful lookup is considered to have expired.
-c or --cache
A whole number signifying the number of recent requests kept in the cache to reduce redundant DNSBL queries.

To get a list with other options, start the application with -h.

Batch processing

You can run Repel as a filter from a terminal/shell, for testing or batch processing. By default, it reads lines of IP addresses from standard input and outputs the decoded DNSBL response.

Alternatively list one or more filenames on the command line as sources for IP addresses to look up. The output is in Apache rewritemap format with the original address first on the line.

Optimization

The tradeoff of using Repel is a slight increase in latency for the requests that require HTTP:BL verification, but this can be minimized by skipping the test for apparent human visitors.

When you have the basic configuration working, you may add additional rewrite conditions to bypass Repel for requests that almost certainly are made by humans, such as when the visitors have authenticated with a password, come from within your own domain, or have a cookie that proves earlier access.

You can speed up DNSBL queries by running a local DNS server. It reduces the time to query DNSBL when there are a time lag between repeated requests from the same address.

Administrators of demanding web servers may consider the mod_httpbl Apache module as an alternative.

Troubleshooting

All responses are NONE
If all responses are NONE, you likely forgot to provide a DNSBL access key. Get one from Project Honeypot.

Visit the Repel support forum for other issues.

Valid XHTML 1.0!