Identifying and Stopping Spam

Tips to separate real emails from spam and stop them from entering your nonprofit inbox

By: Mike Spykerman

April 7, 2009

This article was adapted from a white paper from Red Earth Software. It discusses how spam messages can be distinguished from legitimate messages by looking at email headers and message content and how spam can be blocked effectively by taking these typical spam characteristics into account. Red Earth's Policy Patrol is available to eligible nonprofits and public libraries through TechSoup Stock.

Spam is not only offensive and annoying; it causes loss of productivity, decreases bandwidth, and costs organizations a lot of money. Therefore, every organization that uses email must take measures in order to block spam from entering their email systems. Although it might not be possible to block out all spam, just blocking a large portion of it will greatly reduce its harmful effects.

In order to effectively filter out spam and junk mail, we need to be able to distinguish spam from legitimate messages. To do this, we need to identify typical spam characteristics and practices. Once these practices are known, suitable measures can be put into place to block these messages. Of course, spammers are continually improving their tactics, so it is important to keep up-to-date on new spam practices to ensure that spam is still being blocked effectively.

Spam characteristics appear in two parts of a message: email headers and message content.

Email Headers

Email headers show the route an email has taken in order to arrive at its destination. They also contain other information about the email, such as the sender and recipient, the message ID, date and time of transmission, subject, and several other characteristics. Most spammers try to hide their identity by forging email headers or by relaying mail to hide the real source of the message. Since they need to send mails to a large number of recipients, spammers use certain methods for mass mailing that can be classified as pure spam practices and can, therefore, be identified in the email headers. Although newsletters and legitimate mailings are also sent to a large number of recipients, these will generally not display the same characteristics since the message source does not need to be concealed.

Headers can also be used to trace the origin of the spam message. However, in this article, we are mainly focusing on how to distinguish a spam message from a legitimate message by looking at the email headers, rather than actually tracing the sender of the spam message.

Typical email header characteristics in spam messages:

Recipient's email address is not in the To: or Cc: fields:
The reason for this is that the recipient's email address is hidden in the Bcc: field or X-Receiver field (which is a field that may not be displayed regularly in your email tool), along with a substantial number of other email addresses. Spammers do this in order to conceal the fact that the mail was sent to a large number of recipients, and presumably so as not to publish their email list. Some might add recipients to the Bcc: field for sending out "legitimate" mailings, but this method is uncommon for email of a professional nature. Note, however, that if you do block emails without a local recipient in the To: or Cc: field, you will be blocking all Bcc: messages, even legitimate ones.
Empty To: field:
This is also typical for spam messages. Because spammers send out bulk emails by entering all recipients in the Bcc: field or X-Receiver header, the To: field is often empty. According to RFC 822 (Paragraph A.3.1), the worldwide standard for the format of email messages, every message is required to have at least one email address in the To: field. Therefore, if this field is empty, this must indicate "shady practices."
To: field contains an invalid email address:
Instead of being empty or containing someone else's email address, the To: field can also contain a bogus email address, for example, one without an @ sign or a nonexistent one.
Missing To: field:
Emails that have no To: field at all can quite definitely be considered spam since this can only happen if done on purpose for spamming reasons.
From: field is the same as the To: field:
This is another common practice. Instead of entering a bogus or empty To: field, the email address in the From: field is also used in the To: field. Both email addresses are most likely fake.
Missing From: field:
Again, the reasoning behind this is to disguise the actual sender of the message.
Missing or malformed message ID:
Since the message ID includes information about where the message is coming from, it is often missing or malformed (for example, no @ sign or an empty string) in spam messages. The message ID is in the form of xxx@domain.com. The first part can be anything, and the second part is the name of the machine that assigned the ID. Although message IDs are not strictly required, one can safely assume that they would only be missing or malformed if done deliberately to disguise the source of the message.
More than 10 recipients in To: and/or Cc: fields:
Many spam messages contain more than 10 recipients in the To: and/or Cc: fields. Although this can also occur for legitimate mailings, these will tend to be of a personal nature (which you might wish to block anyway) since most professional companies do not use this method for sending newsletters or mailings.
Bcc: header exists:
In normal email messages, a Bcc: header does not exist since this is stripped from the mail.
X-UIDL header exists:
Incoming messages should not have an X-UIDL header since they are only intended for the mail server to stop it from downloading messages more than once - for instance, when "Leave messages on server" is checked. This header would normally be stripped when the message is received. Spammers add an X-UIDL header to try to get the recipient's mail server to download multiple copies of their message and therefore increase the chance that the message will be read.
Code and space sequence exists:
Many spam mails include a certain code for identification in the subject of the message. To hide the code from the recipient, a large number of spaces are usually placed before the code. This is done so that the recipient won't notice the code or that it is not displayed in the mail client before opening the message.
Illegal HTML exists:
Some spam messages include a code for identification in the text of the message. The text is entered outside the HTML tags so as to hide the code from the recipient. There is no reason to add text outside HTML tags, so the mere presence of illegal HTML can be treated as suspicious.
Comment tags to avoid detection by email filters:
Some spammers try to circumvent content filters by placing lots of HTML comment tags within the email body text. In this way, content filters will not recognize the spam words since they are separated by comment tags. The recipient however, will not see the comment tags since these are not displayed when viewing the message in HTML. Therefore, it is important to use an email filter that can filter emails by removing HTML tags first.
HTML message without plain-text body part:
HTML messages usually include a plain-text version of the email so that recipients with email clients that cannot read HTML can still view the message in plain text. However, many spammers tend to send HTML messages without this plain-text body part, not only to save on size, but also to force recipients to read the HTML version. This enables spammers to embed links and unique IDs in the HTML code. For instance, many spammers include an image link that connects to a site when the message is opened. Since each message contains a unique ID, the spammer will know exactly which recipient has viewed the mail. In this way, spammers know how many people have viewed their message and which email addresses are still "live." When spammers know that your email address is "live," this will entice them to send you even more spam, so it is important to put a stop to these kinds of spam messages by using a filter that is capable of checking this. Newsletters also tend to send messages without a plain-text body part, so it is important to use a white-list of allowed newsletters so as not to catch any false positives.

Message Content

Apart from headers, spammers tend to use certain language in their emails that organizations can use to distinguish spam messages from others. Typical words are free, limited offer, click here, act now, risk free, lose weight, earn money, get rich, and (over) use of exclamation marks and capitals in the text. Spam can be blocked by checking for words in the email body and subject, but it is important that you filter words accurately; otherwise, you might be blocking legitimate mail as well.

How to Stop Spam

Now that we know the typical spam characteristics, how can we use these to stop spam?

First, a mail filtering mechanism must be put in place to block out most of the spam and hoaxes coming into your organization. The email filtering system must be able to analyze email characteristics, classify a mail as spam and delete it, flag it (for instance, add the word "SPAM" to the subject line), or quarantine it. Preferably, you will be able to make multiple filters that decrease in certainty whether a mail is spam. The more certain the filter is, the more drastic the action can be - for instance, deletion of the message. If the filter can only indicate the possibility of a spam message, you could flag the mail or quarantine it. In order to avoid false positives, the email filtering system should be able to exclude white-listed senders.

The email filtering system should filter out spam messages in three ways (in order of "spam certainty"):

  1. Block spam at the gateway by checking domains in real-time black hole lists. Several "black hole lists" contain IP addresses and domains from known spammers. By using these lists, you can filter out a large amount of spam. Not only will you stop a large proportion of spam messages from reaching your users, it will also save you from utilizing your bandwidth to download spam messages since the message is blocked at the gateway, before the mail is even downloaded.
  2. Filter out spam based on email header characteristics. Most of the email header characteristics mentioned previously can safely be used to classify a mail as spam. Therefore, you could decide to delete messages that contain any or some of these spam headers. Since checking email headers is a fast process, it is good to check these before checking the actual email message content.
  3. Identify junk mail content. There will still be spam messages that get through both filters mentioned. The last way to distinguish these mails is by checking for spam message content. Depending on the words you select to filter, this can usually be quite accurate.

You will also need to educate your users. They must know that spam should be deleted right away and that they should never send a reply to a spam mail. This will just confirm that the email address is "live" and will enable the spammers to sell the email address to other companies for further abuse. If the mail is a hoax - for instance, a message about fake viruses, pyramid schemes promising lots of fast-earned cash, or victims asking for support by forwarding their mail - users should delete the message and not forward these messages. If users are educated in this way, you will be able to limit the negative impact of any spam or hoax message that has been able to pass your filters.

How to avoid blocking legitimate emails

Since no spam filter is 100% accurate, it is important that each user check their own spam messages. This can be accomplished by forwarding messages flagged as spam to the user's junk mail folder, however many users forget to check their junk mail folders, therefore potentially leaving legitimate messages to go unnoticed. In addition, the number of spam messages tends to build up in the junk mail folder, making it nearly impossible for the user to go through all the messages if they haven't been checked for a few days.

A preferred way is to send the user daily quarantine reports with a list of the newly quarantined messages. In this way, the user can quickly skim through the list and deliver any wrongly quarantined messages as well as update their white-list. By making use of the tiered email filtering system mentioned above, the known spam would be automatically deleted and only the suspected spam messages would be sent to the user for review in the quarantine report.