Click Scoring

Wednesday, March 21, 2007

Click Fraud: New Definition and Methodology to Assess Generic Traffic Quality

1. What is click fraud?

Click fraud is usually defined as the act of purposely clicking on ads on pay-per-click programs with no interest in the target web site. Two types of fraud are usually mentioned:

An advertiser clicking on competitor ads to deplete their ad spend budgets, with fraud frequently taking place early in the morning and through multiple distribution partners: AOL, Ask.com, MSN, Google, Yahoo, etc.
A malicious distribution partner trying to increase its income, using clickbots or paid human beings to generate traffic that looks like genuine clicks.

While these are two important sources of non-converting traffic, there are many other sources of poor traffic. Some of them are sometimes referred to as invalid clicks rather than click fraud, but from the advertiser or publisher viewpoint, there is no difference. In this paper, we are considering all types of non billable or partially billable traffic, whether it is the result of fraud or not, whether there is or there is no intent to defraud, and whether there is or there is not a financial incentive to generate the traffic in question. These sources of undesirable traffic include:

Accidental fraud: a home-made robot not designed for click fraud purposes, running loose, out of control, clicking on every links, possibly because of a design flaw. An example is a robot run by spammers harvesting email addresses. This robot was not designed for click fraud purposes, nevertheless ended up costing money to advertisers.
Political activists: people with no financial incentives, but motivated by hate. This kind of clicking activity has been found against companies recruiting people in class action lawsuits, and results in artificial clicks and bogus conversions. It is a pernicious kind of click fraud because the victim thinks its PPC campaigns generate many leads, while in reality most of these leads (email addresses) are bogus.
Disgruntled individuals: it could be an employee working for a PPC advertiser or a search engine, who was recently fired. Or it could be a publisher who believes to be unjustifiably banned.
Unethical guys in the PPC community: small search engines trying to make their competitor look bad by generating unqualified clicks, or shareholder fraud.
Organized criminals: spammers and other internet pirates used to run bots and viruses, who found that their devices could be programmed to generate click fraud. Terrorism funding comes in this category, and is investigated by the both FBI and the SEC.
Hackers: many people have now access to home made web robots (the source code in Perl or Java is available for free). While it is easy to fabricate traffic with a robot, it is more complicated to emulate legitimate traffic as it requires spoofing thousands of ordinary IP addresses – not something any amateur can do well. Some individuals might find this as a challenge and generate high quality emulated traffic, just for the sake of it, with no financial incentives.
Traditional media losing market share to PPC advertising have incentive to contribute to click fraud.

In this paper, we will be even more general by encompassing other sources of problems not generally labeled as click fraud, but sometimes referred to as invalid, non-billable, or low-quality clicks. This includes

Impression fraud: impressions and clicks should always be considered jointly, not separately. This can be an issue for search engines, as their need to join very large databases and match users with both impressions and clicks. In some schemes, fraudulent impressions are generated to make a competitor’s CTR look low. Advanced schemes use good proxy servers (e.g. AOL) to hide the activity. When the CTR drops low enough, the competitor ad is not displayed anymore. This scheme is usually associated with self-clicking, a practice where an advertiser clicks on its own ads though proxy servers to improve its ranking, and thus improve its position in search result pages. This scheme targets both paid and organic traffic.
Multiple clicks: while multiple clicks are not necessarily fraudulent, they end up either (i) costing lots of money to advertisers when they are billed at the full price or (ii) costing lots of money to publishers and search engines if only the first click is charged for. Another issue is how to accurately determine that two clicks – say five minute apart – are attached to the same user.
Fictitious fraud: clicks that appear as fraudulent, but are never charged for. These clicks can be made up by unethical click fraud companies. Or they can be the result of testing campaigns, and we call them click noise. A typical example is Googlebot. While Google never charges for clicks originating from its Googlebot robot, other search engines that do not have the most updated list of Googlebot IP addresses might accidentally charge for these clicks. Another example of fictitious fraud further discussed in this paper is fictitious clicks. We explain what fictitious clicks are and how they can be detected.

2. A Black and White Universe, or is it Grey?

Our experience has shown that web traffic isn’t black or white, and that there is a whole range from low quality to great traffic. Also non converting traffic might not necessarily be bad, and in many cases can actually be very good. Lack of conversions might be due to poor ads, or poorly targeted ads. This raises two points:

Traffic scoring: while as much as 5% of the traffic from any source can be easily and immediately identified as totally unbillable, with no chance of ever converting, a much larger portion of the traffic has generic quality issues – issues that are not specific to a particular advertiser. A traffic scoring approach (click or impression scoring) provides a much more actionable mechanism both for search engines interested in ranking distribution partners, and for advertisers refining their ad campaigns.
A generic, universal scoring approach allows advertisers with limited or no ROI metrics to test new sources of traffic, knowing beforehand where the generically good traffic is, regardless of conversions. This can help advertisers substantially increase their reach and tap on new traffic sources as opposed to obtain very small ROI improvements from A/B testing. Some advertisers converting offline, victim of bogus conversions or interested in branding will find click scores most valuables.

A scoring approach can help search engines determine the optimum price for multiple clicks (here I mean true user-generated multiple clicks, not a double click that results from a technical glitch). By incorporating the score in their smart pricing algorithm, they can reduce the loss due to the simplified business rule “one click per ad per user per day”.

Search engine, publishers and advertisers can all win, as poor quality publishers can now be accepted in a network, but are priced correctly so that the advertiser still has a positive ROI. And good publisher experiencing drop in quality can have their commission lowered according to click scores, rather than being discontinued outright. When their traffic gets better, their commission increases accordingly, based on scores.

In order to make sense for search engines, a scoring system needs to be as generic as possible. The scores that we have developed meet this criterion. Our click scores have been designed to match the conversion rate distribution, using very generic conversions, taking into account bogus conversions, and based on patent-pending methodology to match a conversion with a click, through correct user identification. As everybody knows, an IP can have multiple users attached to it, and a single user can have multiple IP addresses within a two minute period. Cookies (particularly in server logs, less so in redirect logs) also have notorious flaws, and we do not rely on cookies when dealing with advertiser server log data.

We have designed scores based on click logs, relying – among other - on network topology metrics. We also have designed scores based on advertiser server logs, also relying on network topology metrics (distribution partners, unique browsers per IP cluster, etc.) and even on impression-to-click ratio and other search engine metrics, as we reconcile server logs with search engine reports to get the most accurate picture. Using search engine metrics to score advertiser traffic allow us to design good scores for search engine data, and the other way around as search engine scores are correlated with true conversions. It also makes us one of the very few third party traffic scoring company serving both sides equally well.

When dealing with advertiser server logs, the reconciliation process and the use of appropriate tags (e.g. Google’s gclid) whenever possible, allow us to not count clicks that are an artifact of browser technology. We have actually submitted a patent to eliminate what is called “fictitious clicks” by Google, and more generally, to eliminate clicks from clickbots.

Advertiser scores are designed to be a good indicator of conversion rate. Search engine scores use a combination of weights based both on expert knowledge and advertiser data. Score have been smoothed and standardized using the same methodology used for credit card scoring. The best quality assessment systems will rely on both our real-time and less granular scores, such as end-of-day.

The use of a smooth score, based on solid metrics, substantially reduce false positives. If a single rule is triggered, or even two rules are triggered, it might barely penalize the click. Also, if a rule is triggered by too many clicks or not correlated with true conversions, it is ignored. For instance, a rule formerly known as “double click” (with enough time between the two clicks) has been found to be a good indicator of conversion, and was changed from a rule into an anti-rule in our system, whenever the correlation is positive. A click with no external referral but otherwise normal will not be penalized, after score standardization.

3. Mathematical Model

The scoring methodology developed by Authenticlick is state-of-the art. It is based on almost 30 years of experience in auditing, statistics and fraud detection, both in real-time and on historical data. Several patents are currently pending.

It combines sophisticated cross-validation, design of experiments, linkage and unsupervised clustering to find new rules, machine learning, and the most advanced models ever used in scoring, with a parallel implementation and fast, robust algorithms to produce at once a large number of small overlapping decision trees. The clustering algorithm is a hybrid combination of unique decision-tree technology with a new type of PLS logistic stepwise regression to handle dozens of thousand highly redundant metrics. It provides meaningful regression coefficients computed in a very short amount of time, and efficiently handles interaction between rules.

Some aspects of the methodology show limited similarities with ridge regression, tree bagging and tree boosting. Below we compare the efficiency of different systems to detect click fraud on highly realistic simulated data. The criterion for comparison is the mean square error, a metric that measures the fit between scored clicks and conversions:

Scoring system with identical weights: 60% improvement over binary (fraud / non fraud) approach
First-order PLS regression: 113% improvement over binary approach
Full standard regression (not recommended as it provides highly unstable and non-interpretable results): 157% improvement over binary approach
Second-order PLS regression: 197% improvement over binary approach, easy interpretation and robust, nearly parameter-free technique

Substantial additional improvement is achieved when the decision trees component is added to the mix. Improvement rates on real data are similar.

4. Bogus Conversions

The reason we elaborate a bit on bogus conversions is because its impact is worse than most people think. If not taken care of, it can make a fraud detection system seriously biased. Search engines that rely on pre-sales or non-sales conversions such as sign-up forms to assess traffic performance can be misled into thinking that some traffic is good when it actually is poor, and the other way around.

Usually, the advertiser is not willing to provide too much information to the search engine, and thus conversions are computed generally as a result of the advertising placing some JavaScript code or a clear gif on target conversion pages. The search engine is then able to track conversions on these pages. However, the search engine has no control on which “converting pages” the advertiser wants to track. Also, the search engine has no visibility on what is happening between the click and the conversion, or after the conversion. If the search engine has access to pre-sale data only, the risk for bogus conversions is high. We have actually noticed a significant increase in bogus conversions from some specific traffic segment.

Another issue with bogus conversions is when an advertiser (let’s call it an ad broker) purchases traffic upstream, and then acts as a search engine and distributes the traffic downstream to other advertisers. This business model is widespread. If the traffic upstream is artificial but results in many bogus conversions – a conversion being a click or lead delivered downstream – the ad broker does not see a drop in ROI. She might actually see an increase in ROI. Only the advertisers downstream start to complain. Once the problem starts being addressed, it might be too late and can cost the ad broker to loose clients. Had the ad broker used a scoring system such as ours, the bogus conversions would have been detected early, even if the ROI was unchanged.

This business flaw can be exploited by criminals running a network of distribution partners. Smart criminals will hit this type of “ad broker” advertisers harder: the criminals can generate bogus clicks to make money themselves, and as long as they generate a decent amount of bogus conversions, the victim is making money too and might not notice the scheme. If the conversions are tracked by the upstream search engine (where the traffic originates), the clicks might erroneously be considered very good.

5. A Few Misconceptions

It has been argued that the victims of click fraud are good publishers, not advertisers as advertisers automatically adjust their bids. However, this does not apply to advertisers lacking good conversion metrics (e.g. if conversion takes place offline) nor smaller advertisers who do not update bids and keywords in real time. It can actually lead advertisers to permanently eliminate whole traffic segments, and lack the good ROI when the fraud problem gets fixed on the network. On some 2^nd-tier networks, impression fraud can lead an advertiser to be kicked out one day, without the ability to ever come back. Both the search engine and the advertiser lose in this case, and the one who wins is the bad guys now displaying cheesy, irrelevant ads on the network. The website user loses too as all good ads have been replaced with irrelevant material.

Another point that we sometimes hear is that 3^rd party auditors do not have access to the right data. Again, not only auditors with large volume of traffic can track network flows just like search engines do, but in addition they have access to more comprehensive conversion data, and are better equipped to detect bogus conversions. In our case, we process search engine and advertiser data: large volumes of data in both cases. However, some auditing firms lacking statistical expertise and / or domain knowledge have had serious flaws in their counting methodology. These flaws have been highly publicized by Google, and overestimated. Due to “fictitious clicks”, 1000 clicks are on average reported as 1,400 clicks by some auditing firms, according to a well known source. The 400 extra “non-clicks” or “fictitious clicks” (they really never existed) are said to be from users clicking on the back button of their browser. It is well known that most visits are just one-page long, and content displayed by back-clicking with your browser is usually served by the browser cache, not by the advertiser server logs. Thus this 1,400 / 1,000 ratio does not make sense. We believe that the issue is of a different nature, such as counting all http requests associated with one page as the click tags are attached to all requests, depending on server configuration. It is also an issue that we have addressed long ago.

Auditing firms performing good quality reconciliation also have access to many metrics typically used by fraud detection systems for search engines: average ad position, bid, impression-to-click ratio, etc.

Finally, many systems to detect fraud are still essentially based on outlier detection and detecting shifts from average. Based on our experience in the credit card fraud industry, we know that most fraudsters try very hard to look as average as possible, avoiding expensive or cheap clicks, using the right distribution of user agents, generating a small random number of clicks per infected computer per day, except possibly for clicks going through AOL or other proxies. This type of fraud needs a truly multivariate approach, looking at billions of combinations of several carefully selected variables simultaneously, looking for statistical evidence in billions of tiny click segments, to unearth the more sophisticated fraud cases impacting large volume of clicks, possibly orchestrated by terrorists or large corrupt financial institutions rather than distribution partners.

Saturday, July 01, 2006

Efficient Click Fraud Detection using Advanced Analytics

To some extent, the technology to combat click fraud is similar to what banks are using to combat credit card fraud. The best systems are based on statistical scoring technology, as the transaction - a click in our context - is usually not either bad or good.

Multiple scoring systems based e.g. on IP and click scores, scorecards and metric mix optimization are the basic ingredients. Because of the vast amount of data, and potentially millions of metrics used in a good scoring system, combinatorial optimization is required, using algorithms such as Markov Chain Monte Carlo or simulated annealing.

While scoring advertiser data can be viewed as a regression problem, the dependent variable being the conversion metric, scoring search engine data is more challenging as conversion data is not readily available. Even when dealing with advertiser data, we have several issues to address. First, the scores need to be standardized. Two identical ad campaigns might perform very differently if the landing pages are different. The scoring system needs to address this issue.

Also, while scoring can be viewed as a regression problem, it is a very difficult one. First, the metrics involved are usually highly correlated, making the problem ill-conditioned from a mathematical viewpoint. There might be more metrics (and thus more regression coefficients) than observed clicks, making the regression approach highly unstable. Finally, the regression coefficients - also referred to as weights - must be constrained to take only a few potential values. The dependent variable being binary, we are dealing with a sophisticated ridge logistic regression problem.

The best technology will actually rely on an hybrid system that can handle contrarian configurations, such as "time < 4am" is bad, "country not US" is bad, but "time < 4am and country = UK" is good. Good cross validation is also critical to eliminate configurations and metrics with no statistical significance or poor robustness. Careful metric binning, and a fast distributed feature optimization algorithm is important as well.

Finally, design of experiments to create test campaigns - some with high proportion of fraud and some with no fraud - as well as usage of generic conversion and proper user identification is critical. And let's not forget that failing to remove bogus conversions will result in a biased system with many false positives.

Friday, May 26, 2006

New Developments in Click Fraud Detection

Definition

Although there is no formal definition of click fraud, it is customary to consider fraudulent any click not resulting from a user genuinely interested in an ad found in a pay-per-click search engine network such as Google or Yahoo. This definition encompasses competitor fraud (depleting your competitor's budget), distribution partner fraud and other types of fraud committed either with or without financial incentives, as well as accidental fraud. Most but not all click fraud cases are potentially subject to prosecution, e.g. under the unfair business practice code.

New Patterns and Trends

There is increasing evidence that new patterns are emerging. While Google has improved impression fraud detection – a practice consisting of generating bogus impressions to reduce ad relevancy of your competitors to drive them out of Google – the fraud has spread to Yahoo and MSN. And more sophisticated bogus impression schemes are taking place on Google. Political activists and disgruntled employees, a new type of fraudsters not motivated by money, click on expensive paid ads from companies that they hate. They know which keywords are expensive.

Traffic distribution partners willing to eliminate competing affiliates on a search engine network are rumored to have used click fraud warfare, or clickware. Other fraudsters, in an attempt to hide their activity, are generating bogus impressions, bogus clicks and also bogus conversions. To get undetected, they keep their CTR and conversion rates to more discrete - yet still too high - levels.

On the other side, many companies are changing their employee internet usage policy for increased security. This means that sometimes, a same company or government agency uses spoofed IP addresses or one IP and one same browser shared by 50,000 employees. This can cause fraud detection systems to fail and generate many false positives, thus inflating fraud numbers. As far as organic search is concerned, we are worried by individuals who have been banned by Google using the same technology that get them banned to eliminate their competitors. This and other schemes have the potential to reduce search results relevancy, already low in some categories such as mortgages. However search engines will fight back with more advanced relevancy algorithms. This is actually one of the priorities for MSN and many others.

On the positive side, We see that some search engines are taking the click fraud issue seriously. Over the long term, we believe that the concept of click fraud will be replaced by the much more meaningful concept of click quality or click profiling, a concept that we are currently implementing (see ClickProfiling.com).

True click fraud is illegal clicking worth investigating by the SEC or FBI because of potential connections with international crime, shareholder fraud or terrorism funding. It represents a small but potentially fast growing percentage due to the technical expertise of these groups. From a click scoring viewpoint, extremely poor clicks account for 10%, very poor clicks for 10%, poor clicks for 10%, and less than average clicks for another 20% of all clicks. Correctly identifying these click segments using an appropriate click scoring system is of critical importance to increase ROI. Sophisticated keyword selection systems should automatically buy dozens of thousands of under-sold keywords and automatically set ads on Google and Yahoo, ideally three ads per keyword. Ebay and Amazon have yet to substantially improve they automated bidding tools though.

On the long term, advertisers will get smarter. Increased PPC with increased fraud and thus lower ROI or even negative ROI can not be sustained over the long term. We believe that the future will eventually bring better fraud detection and increased ROI – possibly with higher PPC - thanks in part to more knowledgeable advertisers and better relevancy algorithms.

Case Studies

Examples of false positive that we were able to identify include a large corporation, let's call it Acme, and the US Army. In the case of Acme, an alarm was raised because of thousands of clicks per day, day after day, by the same IP and same browser, all seemingly coming from a same user. However the keywords associated with the clicks – both paid and unpaid - the velocity and timing, the proportion of paid clicks and referrals did not show unusual patterns. It was found that Acme uses one IP and one browser for all its employees. Similarly, after investigating a bucket of clicks with highly suspicious spoofed IPs, it was found that the addresses were used by the US Army to hide their true origin. This prevents potential criminals from being indirectly informed (by checking IP addresses in their server logs) that they are being monitored by the Army. Again, the clicks were legitimate.

Conversely, we correctly identified another set of spoofed IP addresses as fraudulent with our metric mix that incorporates proprietary keyword categorizations and multivariate statistical distributions. Email spammers accidentally clicking on paid clicks with web robots in their efforts to harvest email addresses made a few mistakes: they were using the same number of clicks per IP per day, at least on the IP addresses that they did not share with legitimate users. In another case, our linkage analysis revealed that thousands of IP addresses were switched off by one distribution partner caught in click fraud. When they reappeared, they were attached to a new partner, clearly showing that the fraud involved clickware or adware. The fraudster knew which computers were infected and possibly sold this information to another criminal.

Finally, We are dealing not only with counterfeit clicks, but also fake impressions and bogus conversions. Click scoring is a complex problem: bogus conversions involve purchases with stolen credit cards or users paid to fill in forms and provide fake information. They can make poor clicks look good if undetected. However, we have developed methodology that preserves the quality of our click scoring system. Interestingly, one of our clients was using a click fraud detection system that failed to capture these bogus conversions in a fraud scheme, because their previous click monitoring system relied on Javascript and clear gif.

Fraud Schemes, Clickware

Different types of undetectable attacks can be carried out against internet companies that bill advertising clients using logfile statistics. These attacks usually rely on IP masking, IP masquerading and fake referrals. IP masking is accomplished by having a web robot accessing web pages through several hundreds of anonymous proxy servers.

In another scenario, trojans are uploaded on popular shareware sites. Once downloaded by a user, these trojans perform the useful tasks they are supposed to do (e.g. hard drive cleaning, virus scanning etc.) but in addition, they randomly "click" on target links, writing fake information in target logfiles using web robot technology.

Competing advertisers, affiliates or partners in a pay-per-click program might want to kill each other to gain market share, using click spam. Target links could consist of paid links associated with selected advertising clients (e.g. perpetrator's competitors) or expensive paid keywords (e.g. "bulk Email" or "online casino") on pay-per-click search engines. Another version of this attack could rely on a virus with an embedded web robot instead of a trojan. The resulting fake information in the target logfiles can not be distinguished from legitimate clicks from real users. The fake clicks have a 0% click-to-sale ratio, driving the advertiser's ROI into negative territory. We have computed that it is possible to generate $200 million in illegitimate charges with a click spam program running non-stop over a 12 month time period on one server.

More recent cases involve ad relevancy fraud. It is possible to eradicate advertisers on AdSense for popular keywords, with a combination of bogus impressions and self-clicks, without using fraudulent clicks.
Another scenario consists of a shareholder essentially using AOL IP addresses and other non anonymous proxies to commit large scale fraud on high dollar keywords on a 3rd-tier search engine, to manipulate the stock price. Once caught, the shareholder would tell that he is the victim of very sophisticated criminals who have spoofed his IP address and are trying to hurt the company that he targets with click fraud. Such a bogus claim is almost impossible to defeat in court, as true IP spoofing really exists and makes the true (non existent, in this case) "spoofer" essentially indistinguishable from the (self-proclaimed, in this case) "spoofee".

A final example would be an advertiser who was banned from Google organic search through nefarious actions committed by one of his competitors, unable to get back into Google unpaid search results, and then seeking revenge and retaliating against all his competitors. He would use an expert scheme involving trending, impression and click fraud distilled over many months. The fraud would increase very slowly over time, making competitors' CTRs a little bit worse each month and his own CTR better (by clicking on his own ads once in a while). Along the same lines, one can think of a distribution partner artificially inflating his revenues by 1% the first month, 2% the second month, etc. with a cap set to 5%.

Our Approach: Click Scoring

While we have considerable experience both with advertiser and search engine data, this section focuses on advertiser data. One critical issue is how to attach a conversion to a click. We have developed patent-pending technology that enables us to correctly identify a unique AOL user, whether genuine, bogus or spoofed. The algorithm even recognizes that the sale from one IP originates from a totally different IP address. It will also detect when a sale and a click from a same IP are actually generated by unrelated users that share the same IP address. Or that a sale and a click from a same IP are actually not related as the users are different but temporarily share the same IP. In most cases, we are also able to explain the missing clicks: click listed in Google reports but not seen in server logs. This amounts to 50% of billed clicks in some cases. In one severe case of missing clicks, we were able to reduce the discrepancy from 50% to 0% and maximize savings to the client.

From a statistical viewpoint, click scoring for advertiser data can be viewed as a general scoring technology. The scoring system is designed in such a way that the score distribution matches conversion rates. Critical issues include the use of universal conversions (with detection of bogus conversions) and standardized scores, selection of an efficient metric mix and optimized robust metric weights generally obtained as solution of a ridge regression problem involving combinatorial optimization (e.g. meta-feature optimization), optimum metric binning, tree forests or contrarian scoring technology. It is also important to detect the (possibly site-dependent) optimum timeout parameter in the user identification algorithm, as we can not rely on cookies to identify users.

Reference

Click Fraud Resistant Methods for Learning Click-Through Rates. Nicole Immorlica et al. Microsoft Research, 2006.