[Ilugc] Writeup on spam control by author of dspam

From: girishvenkatachalam@xxxxxxxxx (Girish Venkatachalam)
Date: Fri Oct 13 16:46:51 2006

Guys,

        Maybe very few have heard of dspam. It is state of the art.

        I loved this writeup by the author. It is really very nice.

        This is one of the finest articles I have ever read in my life.

        Hope you enjoy it too.

        This is yanked from

        http://www.zdziarski.com/papers/justifying.html

        Enjoy!
regards,
Girish

         Papers



         Justifying Statistical Filtering
         (and Open Source Technology)

        Jonathan A. Zdziarski
        jonathan@xxxxxxxxxxxxxxxxxxx

        June 2005

        Two types of people are likely to read this. The first is systems
administrators and the second are executive managers. Most systems
administrators are smart fellowes. They're prone to have a solid understanding
of statistical filtering and its value, and are usually the last people you
have to evangelise about its benefits. Executive management can go either way.
You might be a smart manager, trying to put a proposal and case study together
for deploying a form of statistical filtering (Bayesian, Markovian, etc) or you
may be silently Googling around after feeling really stupid about the current
solution you've implemented (against everybody's recommendation). In either
case, this article is designed to provide a wake-up call to anyone running a
commercial solution (or even thinking about it), and will hopefully explain why
statistical solutions are a wiser decision.

        I don't cover a whole lot of technical detail here, but I do think it's
necessary to at least define a statistical solution in contrast to the
heuristic last-gen solutions. Statistical solutions represent the latest
generation of spam filter. They have some inherent AI (Artificial Intelligence)
qualities that make them particularly good at what they do. This family of
filters includes the now-popular Bayesian filters (pronounced "bay zee in") as
well as other filters using statistical analysis to filter spam (such as
Markovian classifier CRM114 and Chi-Square Bogofilter). In a nutshell,
statistical filters are very different from other filters because they actually
read email (well, sort of)...

        Statistical Filters in a Nutshell

        When a statical filter receives an email, it first breaks the message
down into tiny little pieces (called tokens). These can consist of words, word
pairs, short phrases, or even just a letter or two. How it does this is really
up to the filter author. Once the message is broken up into tokens, the
disposition of each token is examined. Historically, if the word 'Viagra'
showed up in spam most of the time, then that token will stick out as a guilty
marker. Likewise, if the word 'eBay' has been seen primarily in legitimate
messages, it will become a good marker of innocent mail (ham). There is a
specific "probability" that determines the final value of the marker. For
example, if the word 'pizza' has shown up in spam exactly as often as in
legitimate mail, then the word 'pizza' has a 50% chance of being spam (or not
spam). The filter crunches these numbers and can determine whether or not it
believes the message is spam by calculating a statistical probability. For
example, if the message has a 92% likelihood of being spam, then a Bayesian
filter will indeed mark such a message as spam.

        92% is a lot easier to understand for most of us than a score of 3.72.
Most people haven't really got any idea what 3.72 means, including the filter
authors. So not only do statistical filters rely primarily on mathematics to
filter spam, but they also speak the same language most of us do.

        Technical Excellence

        This class of filters are commonly hailed for their extremely high
levels of accuracy. Why are they so accurate you ask? Well for one, they are
analyzing each user's email individually, so they're able to adapt to whatever
email behavior the user specifically exhibits. If the user is into online
shopping, the filter will be less likely to make the mistake of marking some of
their legitimate shopping mail as spam. Another reason they're so accurate is
because they learn and adapt very quickly when they make a mistake, and can
conform themselves to users like a glove. If the filter makes a mistake, it
begins to (with temperament) adjust its internal clockwork so that it won't
make the same mistake in the future (this is what makes the filter learn).

        In contrast, the "other" type of well known filters are called
heuristic filters. Instead of making their own decisions about what spam is,
they rely on a programmer to write a set of detection rules. Just like most
popular virus scanners today, their "spam definitions" come to be out of date
very quickly, as spam evolves. Because heuristic filters have no learning
mechanism, they rely on frequent updates. Another reason heuristic filters are
so terrible at what they do is because these rules are written for the entire
world to use - and the entire world gets a whole lot of different email. If
your email doesn't fit into what the programmer considers "the norm", you're
likely to have a bunch of mail erroneously marked as spam.

        SpamAssassin, for example, was (and still is for the most part) one of
these types of filters, however they've recently added a statistical
"component" to the filter. So now it's not a heuristic filter and it's not a
statistical filter - it's more like a gas/electric hybrid. If you're one of the
27 people who drive a gas/electric hybrid, you probably realize it's not
particularly powerful. This also represents how most appliances are built today
- several layers of heuristic tests and then stick a Bayesian element in at the
bottom. Hybrid filters don't seem to be quite as powerfil either as they're
more of a hodgepodge of tools thrown together than any type of technologically
meaningful solution. In fact, due to what I call heuristic programming, any
statistical components in the solution can end up acting dumbed down as a
result of being told what ham and spam is by lesser-accurate (namely heuristic)
parts of the filter. It seems rather asinine to use the less accurate portion
of the filter to train the more accurate portion, but that's how most hybrid
filters are concocted today. Statistical filters will learn whatever you tell
them to, so naturally when they're trained by something dumber than a human,
they're going to react dumber than a human. I've talk to many people today
using commercial appliances based on either SpamAssassin or some other hybrid
model, and it's quite scary to hear that they're still deleting spam out of
their inbox by the dozens.

        Statistical filtering has now been mainstream for about three years,
but despite its technical excellence, most appliance manufacturers are
outfitting their boxes with the older style filters and even though the box is
technically "new", this old technology is winding up on many networks. We'll
get into that shortly.

        Open-Source Roots

        One of the noteworthy mentions about spam filtering in general, and
especially true of statistical filtering is that open source tools really have
obtained the upper hand in this venture. Statistical filtering is a technology
developed by the open-source community and copied by the commercial industry,
which is quite the stumbling block to companies like Microsoft, who have
frequently positioned the open-source community to appear as a group of pirates
(aargh!) who carbon-copy technology. Not only have our beloved open source
solutions proven to provide some of the best results in the industry, but
they're also free.

        In spite of this, the Internet is having a huge junk sale on anti-spam
products. There are many corporations pushing anti-spam solutions, some that
are even claiming to have some creative ownership in spam filtering technology.
In fact, hundreds of millions of dollars are being spent every year to purchase
appliances that deliver a tenth of the results that the open source community
is giving away for free. Some say this is due to support, or the need for rapid
deployment, but in reality it's because the majority of commercial appliance
manufacturers find statistical filtering so highly accurate that it can do its
job without them. Regardless of whether it's Bayesian, Markovian, Chi-Square,
or whatever - manufacturers have had to face the decision of either losing
money (on nightly energizer updates and the like) or crippling their own
filtering solution to require such constant maintenance that consumers will
subscribe to all possible services for fear of their filter degenerating -
which it will.

        Much to the chagrin of the filter manufacturers, all reasonably written
open source statistical filters actually get better with time (and on their
own), like a fine wine. Imagine a world where there are no rule sets to update,
no whitelists to maintain, and only minor tweaking by a sysadmin occasionally
to blow the dust out of the fans. You've just imagined the next generation spam
filter. In fact, many of the tests out there showing statistical filtering as
superior don't even know the half of it - there's just nothing like a nice
seasoned database that's been learning for a year or more. And sadly, this well
oiled machine just doesn't fit into the monthly recurring business models of
most manufacturers. The best you can hope for in many commercial solutions is a
Bayesian "element". This really is more of marketing buzz than anything,
however, as you'll find it buried deep below several more "heuristic" layers of
analysis - all of which dumb down the true learning potential of any
statistical elements in the box.

        The sad truth of the matter is that most people have a knee-jerk
reaction to spend money in order to own a pretty box. Depending on the color
scheme of the server room, you've got the choice of aqua blue, earth-tone
green, or luscious yellow. It sure looks good bolted into that rack in the
server room, and the fact that it cost $50,000 gives CEOs bragging rights with
equally vendor-conditioned customers. Much like other well-marketed solutions,
many of the appliances out there deliver in the board room better than they
deliver in the server room. I initially thought that after the first dot-bomb,
corporate America would begin to wake up and see through all the marketing
glitz poured over what are, for the most part, substandard products. In spite
of the new found financial sobriety in most technology companies today, many
are still falling victim to make decisions based on buzz rather than actual
technical specification. And since customers are sensitive to buzz, it's
sometimes better to actually buy a well marketed product than one that performs
well. This leads me to one thing I've learned over the past ten years of
working startups: many corporate executives aren't really interested in
technology as much as image.

        This rant does have a point, and I do want to address some of the many
reasons large corporations should be considering open source solutions -
especially the many superior open-source solutions that are available for
eliminating spam and light years ahead of commercial solutions. I'm by no means
against free enterprise - I in fact would love to see a few anti-spam startups
get out there and market some truly statistical, adaptive products that have a
chance at solving the spam problem (Death2Spam is one such product, I hear).
What I do take issue with is that a legitimate company ought to have a viable
product. How you define a viable product is open to some creative license, but
certainly part of it must mean that it's better than something you can get for
free.

        If you're a frustrated employee at a large corporation or Internet
service provider and can't get your hands around why others don't get the value
of open source, you're not alone. Your managers will probably be hitting you
with some questions you may not have even thought about. It seems so obvious to
the thinking population why a better functioning open source solution should be
preferred over an inferior commercial product, that sometimes we don't even
think about the details. The rest of this article is dedicated to explaining
why open source solutions especially make sense in the setting of spam
filtering.
        Cheap Pickup Lines

        We've been fed some cheap pickup lines, and most of us have fallen for
them at one point. But at the end of the day, free kittens and $25 kittens
share two things in common: they both meow and poop. The commercial solution
might not be justified by that of technical specifications, but ROI (as I said,
corporate America isn't interested in technology).

        The rational question for the technologically challenged, and where
some of the ammunition resides in justifying open source, is in return on
investment. Does filtering have a better return on investment than managing
spam? Does a commercial product have a better ROI than an open-source solution?
Some of the common questions non-technical leadership will likely have are
provided in the next section and will be crucial for pushing some CEOs over the
hump of needing an aesthetically pleasing case. We'll also dispel some of the
marketing myths and choice pickup lines you'll usually hear from commercial
software gigolo's to sell you their product.
        Marketing Pickup Line #1: You need support

        Well yes, you do need support. What you don't need is planned support.
The difference is that support involves assistance with reasonably complex
tasks while planned support involves making the product more difficult than
necessary, to facilitate a support contract.

        Generally speaking, companies that tout support as a "value-add" are
doing so because their product has been designed for difficulty in maintenance,
so you'll pay them for the ability to pick up the phone on occasion - most
likely about a problem they created themselves by designing a poor product.
These support contracts provide a good bit of residual income for large
corporations by paying annually just to be on "standby". You need support much
like you need aerosol spray - it comes in handy sometimes, but if you stop
hanging out by the bathroom you won't need it as much. Not only do many
companies provide poor support, but they provide it overseas in India - which
is where you'll likely be calling when you have a problem.

        In the open-source community, things are very different. The quality of
the product is more important than generating a revenue stream from support, as
the philosophy under which the project was originally started is likely to be
more philanthropic in nature. And since open source projects are started with
the expectation that other people will be using them (on a low level), they're
usually designed to be understandable by other knowledgeable administrators.
Any open-source product worth its time has been both well-written and
well-documented to make it fairly easy for a sysadmin to implement and use on
their network. If the admin gets stumped, the open-source community supports
the growth of two primary areas of support:
        Free Support

        The open source community likes things free. Large community support
forums have been home-grown for many open-source projects. This means
implementors will be able to receive free support from experts in the field who
are using the software hands-on - actual technical people who speak your native
language and have actually seen the product they're supporting. In contrast,
the commercial world leaves the poor systems administrator having to go through
their sales contacts, a sales engineer, and potentially two or three other
people before finding the answers they need - all while paying for their time.
All the money that was sunk into a commercial support contract will start to
seem like an awful lot as they hold the line for overrated support, that will
most likely fail to provide any real answers to their problems (judging by the
poor quality of customer service in today's technology marketplace). In the
time it takes some to reach some technical specialists about products, they
could have already had an answer from a community support forum or chatroom.
        Support Contracts

        As the popularity of a project grows, so do the number of people
looking to earn a living supporting it. Open source developers have come to
realize that corporations require support, and many have acommodated this
requirement by providing paid support options. These are sometimes available
directly through the spam filter author, or by others closely related to the
project. This does several things - it promotes healthy competition among these
groups, which helps keep support costs down, and it also means that if you
stump your support group with a project, there are many other options
available, as opposed to commercial support which lends itself to the
cookie-cutter approach. Diverse support also means that there is a stronger
likelihood you will find a group specializing in your specific area of interest
(such as implementing product X on Solaris with an Oracle backend).
        A Support Monoculture

        Commercial software creates somewhat of a paid support mono culture.
All the support you're going to receive generally emanates directly from the
software manufacturer or, if they are large enough, from a value-added reseller
who has trained their staff with the same learning materials. In other words,
if someone can't solve your problem in the commercial world, nobody can. In the
open-source world, there's a very diverse set of paid support options available
from professional services consultants who specialize in open source, and are
all based on different learning experiences.

        There are bright, hard-working individuals in the open source industry
who are so hands-on that they can probably solve your problem within a fraction
of the time it would take a group of mediocre corporate support specialists.
Bugs get fixed quicker, people respond faster, and all this at rock bottom
prices. If you're thinking about the need for support, hook up with an open
source support provider as you prepare to deploy the project on your network.
If the project is any good, you shouldn't need nearly as much support as the
marketing executives of the corporate world would have you believe.
        Marketing Pickup Line #2: You need training

        If vendors haven't managed to convince you that open source is a
failure because there is "no support", the next thing they'll try to sell you
on is that there is "no training" available. Training options in the open
source world are very similar to support options. Professional consultants can
provide whatever training is needed for whatever projects it is needed for.
What's more important to consider when you're talking about such a tool as spam
filters is why you need much training in the first place. An anti-spam solution
should be simple to use - otherwise your customers won't use it, and you just
wasted all your money on a commercial product - with great training and support
- that nobody's interested in using. If you have to train "Grandma" how to use
your spam solution, you're doing two things wrong. First, you're implementing a
solution that's too complex which will not be used by many customers, and more
importantly you're kissing any chance of a return on investment goodbye by
spending all that extra money to provide technical support to the ones who call
in with questions.

        If you work for the average American corporation, you'll find it
difficult to capitalize on technical support because customers demand it free.
In this case, you're probably already looking for ways to make support less
expensive. Your call centers may even be outsourced overseas, or filled with
bottom-wage employees who have mastered the art of getting people off the phone
without actually helping them. Every dime you spend teaching your users how to
use the software is money you would have otherwise saved. If your solution
requires a lot of end-user training, it may possibly end up costing more than
managing the spam problem in the first place.

        If a software vendor is touting training as a key selling point, this
only means their software is so complex that you need training to use it. Next
time one of their vendors tries to schmooze you and raises this point, ask them
why you need training in the first place, if their product is so easy-to-use.
If it's not easy-to-use, why would you want one?

        A majority of open source anti-spam solutions have been designed to be
very simple to use, even for Grandma. If a spam makes it past the filter, the
user can forward or bounce the message in, click a link, or perform some other
trivial task to train the filter. This also provides a sense of participation,
which is something a lot of users want in today's world of privacy rights and
service control. Not only do most solutions provide an easy-to-use interface
like this, but they've been designed flexible enough for systems administrators
to implement custom installs. Proprietary systems running IMAP or web-based
email can easily be configured with a "Spam Folder" or some other type of
device to make managing spam brain-dead easy for the user.
        Marketing Pickup Line #3: You DON'T Need "Training"

        The other extreme in touting training is that you don't need any
training; that you can just plug the product in and make it work without the
user needing to do anything. Steer very clear from these products - they are
not true learning products! In many cases, static "out-of-the-box" products
push the responsibility of training to the systems administrator or charge an
annual subscription fee to keep the filters updated. Not only does this cost
more money, but it provides very poor filtering with a high risk of errors,
because all the filtering is centered around what someone else (the systems
administrator or the software manufacturer) thinks about a user's mail, rather
than what the user thinks about their own mail. If a user can't teach the
filter their specific email behavior, the filter won't provide an acceptable
level of results for the money. The ability for users to provide feedback into
the system is important not only in training the software, but it gives the
user a sense of satisfaction that they're able to do something about spam -
rather than call the abuse department to complain.

        Installing software on your network that's capable of only 95%
filtering accuracy (and provides no feedback mechanism) is going to increase
the likelihood that a customer will call in to tech support. Knowing such a
system exists on the network will inevitably make users more critical of their
inbox. Should they receive a single spam, many less savvy users think the
filter isn't working correctly, and will call in to be a "good Samaritan" and
let you know that they received a spam - they'll most likely want to forward it
to an abuse address somewhere where more network traffic will be generated, and
a human will have to respond to it. Add to this the livid customers who call to
complain about lost email or false positives - users who are waiting on an
important email, and call in because they believe the filter ate it, or find an
email erroneously marked as spam that they feel is an inconvenience, and
therefore want to make it tech support's inconvenience.

        Lack of a feedback loop does more than hurt accuracy. It costs money.
Be very wary of products touting the ability to perform without user
participation.
        Marketing Pickup Line #4: Commercial solutions are more scalable

        Some commercial applications are scalable, but more aren't. In most
cases, commercial solutions are bloated with non-statistical components that
aren't necessary to good filtering. A lot of individuals buy these tools
because they don't want to train all of their users' filters - a justifiable
need. It's important to realize, however, that there are alternatives to the
complete training of a statistical solution such as global seeding, merged
groups, and other approaches that provide almost out-of-the-box filtering with
little effort. Your mileage may vary in the scalability of open source
projects, but there are at least a few whose execution time is measured in
hundredths of a second.

        The DSPAM agent runs with a very low execution time between 0.01s to
0.03s for classification and 0.03s to 0.10s for training, actual real time and
on average desktop hardware. The CRM114 discriminator is similarly fast in
performing between 0.05s and 0.10s execution time for classification. Plenty of
open source tools outperform even the most expensive commercial products on the
market. Not to say a commercial product isn't capable of performing well, but
they are certainly not more scalable than what's freely available. Many open
source projects have been deployed on systems with several hundred thousand
users - there's no justification to suggest that a commercial application could
do any better.

        In fact, when corporations begin to scale to this many users, there is
usually a dramatic cost difference between commercial and open-source
implementations. Even a scalable commercial implementation will generally cost
more to implement in licensing and support contracts than an open-source
solution.
        Marketing Pickup Line #5: Commercial solutions are more accurate

        Trust no-one. This is quite the contrary for this specific area of
technology. In the setting of spam filtering, a majority of commercial
solutions today are advertising levels of accuracy from last-generation
filtering - somewhere between 95% - 99%. This means roughly between one and
five errors per hundred emails! There are a few that tout five-nines accuracy
but many of them are just flat out lying, or require your users to manage
whitelists or challenge/response mechanisms. Users of one particular filter
making this claim (rhymes with blightmail) have reported filtering rates
falling as low as into the mid-80's without whitelisting. Well-written
open-source filters have achieved rates of 99.5% to 99.9% and beyond with
little effort. This means between one and five errors per thousand emails.
That's right, they're more than ten times as accurate as commercial solutions!
A few open source filters have even managed to achieve close to five-nines
accuracy, with the highest peak recorded at 99.991% using purely statistical
methods of filtering.

        The problem with the industry today is that these numbers are getting
thrown around enough to confuse unsuspecting managers who flunked math in high
school. Is there really a difference between 99% and 99.9%? A big one! (10/1000
spams vs. 1/1000). Unfortunately, people seem to forget how to do math rather
quickly when in the presence of a pretty server.

        If you're measuring ROI, accuracy is crucial. Inaccuracy costs a
company money. Money translates to bandwidth, server resources and people time
to answer complaints or manage spam. And if your filter performs too poorly,
filtering itself might be so useless that you still have to delete mail in
chunks. There's a significant loss of productivity in the users who have to
delete the spam (something that's important if you're paying these people to do
something). An error prone solution will cost:

* Money for the initial purchase and installation of the equipment
* The additional bandwidth to cover tens of thousands of extra spams
* The additional server resources to cover tens of thousands of extra messages
* Several hours of total productivity to delete spam
* Loss of productivity for loss of legitimate mail
* Additional salaries paid to cover increasing tech support expenses

                                                                    Think about
the total amount of money spent on resources, phone calls, and time and you'll
see that the price in paying for an error-prone solution only capable of
achieving 99% is really far more than the sticker price. Inaccuracy costs more
than accuracy. Solutions are available which cost considerably less to
implement and provide higher levels of accuracy, or rather lower levels of
inaccuracy.
                                                                    The Death
of Old Technology

                                                                    As I
mentioned, most commercial anti-spam solution manufacturers are still using the
old heuristic approach to filtering in order to generate monthly revenue, and
that's unfortunately giving the spam filtering space the image of a snake-oil
salesman. Spamming on behalf of anti-spam solutions doesn't help perception
either. As these commercial products age, the annual subscription keeps their
customers paying for what would otherwise become an entirely useless product.
Most companies are willing to pay an annual subscription to maintain the status
quo of the technology industry - we all pay support and maintenance. People
don't expect anything more because most applications are static and require
babysitting.

                                                                    As we move
into the world of AI, we've opened up a very dangerous can of worms to this
standardized way of doing business - or a very refreshing one. Our AI tools are
capable of learning how to improve, and actually do their job better as the
software becomes older. The danger to monthly residual is that these tools
could sit on a network for five years collecting dust - and still perform
better than the latest modelheuristic filters out-of-the-box within a few weeks
time. This is something to be very concerned about if you're selling the
obsolete technology most manufacturers are selling today, but it's also very
comforting for the few who understand the vision behind this AI technology and
are forming business models around it.
                                                                    Preventing
a Mono-Culture

                                                                    There are a
lot of different filtering appliances out there, and this is fortunate in that
it helps to prevent a mono culture. That's not to say that any of these
companies wouldn't like to be on top. The problem many companies are challenged
with, and beginning to encounter, is that because their appliances are fairly
static, spammers are becoming some of their customers, and using their machines
to test how well their messages will get around the filter. It's pretty easy to
change a spam around if you're able to run it through the target filter every
time until it finally gets accepted, and with a dumbed-down Bayesian element
this is much easier. The adaptive learning provided by true statistical filters
is the only solution to this, and makes this practice impossible. It's
extremely difficult for a spammer to construct a message that will circumvent a
large number of Bayesian filters. This is because each user's filter is based
on the user's own personalized behavior. There are plenty of dirty little
tricks spammers have tried to use in the past to circumvent filters and they
only appear to work on these older heuristic code bases - our adaptive learning
filters are seeing right through their tricks. On top of this, open source
filters have the added advantage of being successful amidst also being
completely exposed. Their full source code available, you can be certain that
any spammer who wants to circumvent an open source filter would be looking for
loopholes in the code. If any are found, they're quickly discovered and
patched. Because open source projects are commonly community-based efforts,
they have the advantage of a large-scale, multitalented development group who
is motivated by creativity, rather than salary.

                                                                    Advanced
algorithms such as Bayesian Noise Reduction make it computationally infeasible
to perform many of the more advanced attacks. Spam is ever-changing, and that
yearly check sent into the spam filter manufacturer is only a leash.
Statistical filtering gets better the more you use it. The biggest fear of
these present-day filter manufacturers is the fear that someday it begins to
catch on that there are other (free) solutions out there that get better with
time. It's easy to see why there are so many companies out there using
buzzwords and avoiding statistical filtering - because they lose their leash.
                                                                    Maintenance

                                                                    Finally,
maintenance stands the chance of hammering the final nail into the coffin of
heuristic filtering. Maintenance between statistical filters and the heuristic
filters of yesterday is very different. Heuristic filters demand the attention
of the systems administrators or monthly subscription for automatic updates
(spam detection rules coming from complete strangers); frequent updates must be
installed or transmitted to counter the dynamic nature of spam with new rules.
This is ideal if you plan on having a dedicated anti-spam administrator (or a
group of them), but most companies don't want to spend an extra hundred
thousand dollars on additional employees just to support the so-called
"solution" that can't really perform very well on its own anyway. Why have one
man doing 100% of the work when you can have all of your customers doing 1/10th
of a percent of the work? Not to say that each user must train from scratch, as
many statistical tools allow for a global database to start all their users off
initially, but forwarding an occasional spam into the system is certainly not
what you want to be paying your employees to do. Distributing the
responsibility out to the end-user does two things. First, it frees up the
staff to work on other projects (nobody wants a dedicated spam guy, especially
the guy who's appointed as the dedicated spam guy). It also prevents a total
stranger from making decisions about what your filter thinks is spam. Second,
it makes each user responsible for their own filtering. Users who don't want to
filter themselves merely don't participate. Users who diligently mark spams and
correct errors are rewarded with more accurate filtering. This will please the
many users who have censorship issues by allowing them to censor themselves.
Users want to feel like part of the solution; they have an inner-urge to want
to forward the spams they receive somewhere. Why not take advantage of this and
allow them to participate. For large implementations where this is not
possible, the global database concept works - set up a global database to
provide out-of-the-box filtering, and let your users customize the filter's
behavior by occasional training.
                                                                    Final
Thoughts

                                                                    The
consistent fear manufacturers have is that AI makes decisions for people, so
that you don't need people in the loop. In reality, AI does make decisions for
people, but not the important ones. Why should people have to devote their time
to determining if messages are spam? Why should support groups be necessary to
answer first-level requests for information? A lot of companies are scared of
AI, and with good reason. The companies who are manufacturing tools that don't
adapt or help learn how to make decisions will eventually be left by the
wayside if they don't change. There will be a time of adaptation to this new
technology, but like setting fire to a field, what gets burned away will be
replaced with something much better. AI is here to stay, and has been mastered
by the open source community.

[Ilugc] Writeup on spam control by author of dspam

Other related posts: