Quantcast

Friday, June 15, 2012

60 articles from the data science eBook


Part I - Data Science Recipes
  1. New random number generator: simple, strong and fast
  2. Lifetime value of an e-mail blast: much longer than you think
  3. Two great ideas to create a much better search engine
  4. Identifying the number of clusters: finally a solution
  5. Online advertising: a solution to optimize ad relevancy
  6. Example of architecture for AaaS (Analytics as a Service)
  7. Why and how to build a data dictionary for big data sets
  8. Hidden decision trees: a modern scoring methodology
  9. Scorecards: Logistic, Ridge and Logic Regression
  10. Iterative Algorithm for Linear Regression
  11. Approximate Solutions to Linear Regression Problems
  12. Theorems for Traders
  13. Preserving metric and score consistency over time and across clients
  14. Advertising: reach and frequency mathematical formulas
  15. Real Life Example of Text Mining to Detect Fraudulent Buyers
  16. Discount optimization problem in retail analytics
  17. Sales forecasts: how to improve accuracy while simplifying models?
  18. How could Amazon increase sales by redefining relevancy?
  19. How to build simple, accurate, data-driven, model-free confidence i...
  20. Comprehensive list of Excel errors, inaccuracies and use of non-sta...
  21. 10+ Great Metrics and Strategies for Email Campaign Optimization
  22. 10+ Great Metrics and Strategies for Fraud Detection
  23. Case Study: Four different ways to solve a data science problem
  24. Case Study: Email marketing -  analytic tips to boost performance b...
  25. Optimize keyword campaigns on Google in 7 days: an 11-step procedure
  26. How do you estimate the proportion of bogus accounts on Facebook?
Part II - Data Science Discussions
  1. Statisticians Have Large Role to Play in Web Analytics (AMSTAT inte...
  2. Future of Web Analytics: Interview with Dr. Vincent Granville
  3. Connecting with the Social Analytics Experts
  4. Interesting note and questions on mathematical patents
  5. Big data versus smart data: who will win?
  6. Creativity vs. Analytics: Are These Two Skills Incompatible?
  7. Barriers to hiring analytic people
  8. Salary report for selected analytical job titles
  9. Are we detailed-oriented or do we think "big picture", or both?
  10. Why you should stay away from the stock market
  11. Gartner Executive Programs' Worldwide Survey of More Than 2,300 CIOs
  12. Analysts Explore Cloud Analytics at Gartner Business Intelligence Summit 2012
  13. One Third of Organizations Plan to Use Cloud Offerings to Augment BI Capabilities
  14. Twenty Questions about Big Data and Data Sciences
  15. Interview with Drew Rockwell, CEO of Lavastorm
  16. Can we use data science to measure distances to stars?
  17. Eighteen questions about real time analytics
  18. Can any data structure be represented by one-dimensional arrays?
  19. Data visualization: example of a great, interactive chart
  20. Data science jobs not requiring human interactions
  21. Featured Data Scientist: Vincent Granville, Analytic Entrepreneur
  22. Healthcare fraud detection still uses cave-man data mining techniques
  23. Why are spam detection algorithms so terrible?
  24. What is a Data Scientist?
  25. Twenty seven types of data scientists:  where do you fit?
Part III - Data Science Resources
  1. Vincent’s list
  2. History of 24 analytic companies over the last 30 years
  3. Fifteen great data science articles from influential news outlets
  4. List of publicly traded analytic companies
  5. Thirty unusual applications of data sciences, analytics and big data
  6. 50 unusual ways analytics are used to make our lives better
  7. Berkeley course on Data Science

Tuesday, August 19, 2008

New fraud scheme on Google (phishing / click fraud)

Fraudsters send you a fake email about your AdWord account being terminated. They ask you to renew your account by login on to a fake Google AdWord website that looks real. That's how they steal your login/password. Once your account is hijacked, they increase your daily budget and your bid for keywords that are part of their botnet system. In the process, they might also steal your credit card info or other useful info (your address for identity theft, your keyword list to feed their botnet).

Complaint received from a client:

Vincent....don't know if you'd be interested in this...but i use
google ad words & just recently someone hacked into my profile &
changed my daily max from $10 to $6,810, and then miraculously i
received over 1,000 clicks that day at $5.50 per click....they were
trying to charge me over $7K. I reported it and about a week later
they admitted it was not legitimate. Have you heard of this


Email sent by fraudsters:

Renew Your Account Now !

Dear Member,

This is your official notification from Google Inc. that the service(s) listed below will be deactivated and deleted if not renewed immediately.

As the Primary Contact, you must renew the service(s) listed below or it will be deactivated and deleted.

Renew Now your Google AdWords services. [link deleted]

SERVICE: Google AdWords
EXPIRATION: August, 19 2008

Thank you for using Google Inc service.
We appreciate your business and the opportunity to serve you.

Google AdWords Service .

Sunday, February 24, 2008

Invitation to join Analytic Bridge

Analytic Bridge has grown from 20 to about 400 people in just one week. We invite you to revisit our network, and sign up if you are not already a member.

In the last seven days, we have added many groups, several white papers, dozens of useful links. Also, members have contributed to several forums, including

  • Explanation of Variance Inflation Factor
  • Data Validation
  • Post your best graphs in our photo section
  • How to produce nice graphs with R?
  • Data Warehousing, ETL and Business Intelligence opportunites
  • Professional Certificates (chartered statistician, SAS certified, series 6, etc.)
  • Spatial ETL Pros Needed for Leader in Geographic Business Intelligence Solutions
  • Genetic Data Mining Method for the Proper Use of the Correlation Coefficient
  • Who makes $100K or more a year?
  • Companies hiring statisticians and data miners
  • Best books for learning data mining
  • Basic Introduction to Text Mining
  • Non-Linear ARIMA using neural nets?
  • Statistics handbooks now available in the links section
  • Jobs in Switzerland
  • Interesting discussions on the Web Analytics group
  • Data mining blog
  • LinkedIn, Plaxo, Facebook and other networks
  • Building Statistical Regression Models: Straight Data are Necessary
  • Domain names for sale
  • Career paths: switching to a different industry
  • Generalized Goldbach Conjecture and Integer Coverages
  • XML job feeds available for your blog
  • Starting Salaries for Analytic Graduates
  • Useful links
  • Statistical Software - Comparative Analysis
To join Analytic Bridge, visit us at http://www.analyticbridge.com/. Members are also entitled to a 20% discount on all products available on DataShapingStore.com. Please contact us for details.


Vincent Granville, Ph.D.
Founder and Principal
http://www.datashaping.com/
http://www.analyticbridge.com/

Tuesday, July 03, 2007

Massive Click Fraud Case Unearthed in our Laboratory

Here we provide specific details about a widespread botnet still operating. As many as 50% of all advertisers may be victims, albeit with a low frequency. It is connected with a particular search distribution partner on the largest search engine network. We will call it Spiralup, although its real name is different. Their brand is associated with spyware, though they have clearly added click fraud to their areas of focus.

  1. Their traffic has been growing exponentially over the last few years, according to Alexa (see graph below). Note that Alexa can’t always discriminate between real and fake traffic. Software (AlexaBooster) is available which allows a user to artificially inflate Alexa rankings.
  2. Note two sharp dips in early 2006 and 2007 (see graph below).
  3. In 2006, the browser distribution was different, with more Firefox, possibly indicating a network of human beings paid to click.
  4. In 2007, the browser distribution shifted, favoring Internet Explorer, as they employ a botnet programmed specifically for IE but not for other browsers.
  5. They continually add new advertisers to their target list, but rarely generate more than 3 clicks per day per advertiser. Newly infected computers are assigned to advertisers recently added to their list.
  6. Advertisers accepting clicks from foreign countries, and small advertisers, are hit hardest.
  7. A portion of their traffic is real, a portion of it is bogus, generated by botnets (clicking agents attached to viruses), and a portion of it comes from human beings paid to click according to a pre-specified schedule.
  8. Because they have infected so many computers, they are able to use a very large pool of IP addresses, though the traffic skews towards international, and some specific IP blocks and foreign transparent proxies are widely used.
  9. Their traffic patterns are associated with unrealistic variances and they generate an extremely high proportion of bogus conversions.
  10. Below is a table with four sample clicks:

    • 13/May/2007:08:58:54, query=data+marts, IP=xxx.139.16.154
    • 02/May/2007:04:31:47, query=on+line+shopping+sears+canada, IP=xxx.55.121.2
    • 06/Jan/2007:02:22:23, query=malpractice, IP=xxx.115.106.226
    • 13/Feb/2007:19:33:17, query=fort+myers+mesothelioma+lawyers, IP=xxx.152.21.8

    Details:

    • Each click is from a different advertiser.
    • Each click has a Google gclid tag.
    • The time zone is from the advertiser log.
    • The first click was billed at full price (even days later, the charge did not disappear). It resulted in a bogus conversion. It also triggered an HTTP request on the target page for a blank stylesheet.
    • This means that the botnet is a parasite of Internet Explorer, and does not have its own code to connect to the Internet, but rather relies on Internet Explorer to do so.
    • All four clicks have IE 6 as a user agent, as one would expect.


Spiralup's exponential traffic growth:




Sunday, April 22, 2007

Click Fraud Attacks: Emerging Trends

Click fraud attacks have become significantly more sophisticated over the last few months. At the same time, click fraud detection systems are becoming increasingly more efficient to detect smart attacks. Here, we describe three cases that were caught by Authenticlick over the last seven days.
  • Bogus Conversions

    Over a period of several months, a single distribution partner generating well over 1% of the traffic from the leading search engine network was responsible for up to 15% of the downstream conversions. All these conversions were found to be fake. The distribution partner in question was targeting advertisers where conversions consist of filling up a web form. These advertisers are an easy target for smart fraudsters. In addition to generating bogus conversions, the culprit operated from abroad and experienced an usually fast rate of exponential growth over the last two years.

  • Fraud through AOL and other "good proxies"


    Another fraud case was identified last week, generating a large proportion of clicks from known good proxies including AOL. This type of scheme is more difficult to detect. Authenticlick was able to unearth the fraudulent activity thanks to advanced methodology based on network topology metrics. It is interesting to note that the fraud scheme was detected, even though the data submitted by the search engine did not include any information about the user agent.

  • Fraud involving a symbiotic relationship between a distribution partner and an advertiser


    This interesting fraud case involves a very large number of IP addresses, but a very small number of advertisers. It was first identified by Authenticlick in April 2007. It is believed that either the advertiser and the fraudster have a symbiotic relationship, or the advertiser is a victim who benefits from click fraud as the fraudster improves the victim's ROI, through a particular type of fraud described here.


Additional Notes about Adware

The last fraud case discussed in this article is particularly interesting in the sense that it almost certainly implies viruses (adware or spyware) installed and remotely controlled over thousands of computers. Two types of viruses are currently active:
  • The first type actually triggers Internet Explorer and is best described in Google's paper. It is an Internet Explorer parasite. This type of virus is easier to detect as it generates too many clicks per user.

  • The second type of hitbot does not rely on Internet Explorer to trigger clicks. Instead, it has its own code to communicate using the HTTP protocol. This type of virus, more widespread than the previous, is more difficult to detect. Yet, as it relies on user agent lookup tables to generate clicks, Authenticlick has been able to identify this type of fraudulent activity, as criminals (so far) have not been able to correctly replicate the expected underlying multivariate distributions. Also note that we have developed a patented solution to catch this type of fraud.

Sunday, April 15, 2007

How Can Advertisers Benefit from Click Scoring?

Since click fraud detection is a rudimentary application of click scoring, one thinks of click scoring as a tool to eliminate unqualified traffic. Click scoring can actually do much more, such as determine optimum pricing associated with a click, identify new sources of potentially converting traffic, measure traffic quality in the absence of conversions or in the presence of bogus conversions, and assess the quality of distribution partners, to name a few applications. Also note that scoring is not limited to clicks but can also involve impressions and metrics such as clicks per impressions.

From the advertiser viewpoint, one important application of click scoring is to detect new sources of traffic to improve total revenue, in a way that can not be accomplished through A/B/C testing, traditional ROI optimization or SEO. The idea consists of tapping into delicately selected new traffic sources rather than improving existing ones.

Let us consider a framework where we have two types of scores:

  • Score I: generic score computed using a pool of advertisers, possibly dozens of advertisers from the same category.
  • Score II: customized score specific to a particular advertiser.

What can we do when we combine these two scores? Here's the solution:

  1. Scores I and II are good. This is usually one of the two traffic segments that advertisers are considering. Typically advertisers focus their efforts on SEO or A/B testing to further refine the quality and gain a little edge.
  2. Score I is good and score II is bad. This traffic is usually rejected. No effort is made to understand why the good traffic is not converting. Advertisers rejecting this traffic might miss major sources of revenue.
  3. Score I is bad and score II is good. This is the other traffic segment that advertisers are considering. Unfortunately this situation makes advertisers happy: they are getting conversions. However this is a red flag, indicating that the conversions might be bogus. This happens frequently when conversions consist of filling web forms. Any attempt to improve conversions (e.g. through SEO) are counter-productive. Instead, the traffic should be seriously investigated.
  4. Scores I and II are bad. Here, most of the time, the reaction consists of dropping the traffic source entirely and permanently. Again, this is a bad approach. By reducing the traffic using a schedule based on click scores, one can significantly lower exposure to bad traffic and at the same time not miss the opportunity when the traffic quality improves.

This discussion illustrates how scoring can help advertisers substantially improve their revenue.

Case Study
We have applied this concept to optimize the traffic on a partner website, where conversions consist of filling up a web form to subscribe to a newsletter.

  • One source representing 25% of the traffic was producing negative results, even though the scores were very high. After investigating the case, we realized that the landing page was not targeted for the user segment in question. After modifying the content to better target these users, the website experienced a substantial page view increase and visit depth - and higher revenue. Eventually we decided to increase this source to 50% of the total traffic.
  • Another source represented 2% of the paid clicks but 30% of the conversions from a major network. After investigation, all conversions (most of them, bogus) originating from this source were discarded, but the source continued to be monitored. Without this discovery, they would be sending newsletters to thousands of people who never actually subscribed, without knowing it (until complaints arrive).

Comparing Click Scores with Conversions: Goodness of Fit




(click on image to enlarge)

Comments:
  • Overall good fit
  • Peaks could mean:

    1. Bogus conversions
    2. Residual noise
    3. Model needs improvement (e.g. incorporate anti-rules)

  • Valleys could mean:

    1. Undetected conversions
    2. Residual noise
    3. Model needs improvement

Typical Click Score Distribution




(click on image to enlarge)

Comments:
  • Reverse bell curve
  • Scores below 425 correspond to clicks that are clearly unbillable
  • Spike at the very bottom and very top
  • 50% of the traffic has good scores
  • In this scorecard, a drop of 50 points represents a 50% drop in conversion rate: clicks with a score of 700 convert twice as frequently as clicks with a score of 650.

Wednesday, March 21, 2007

Click Fraud: New Definition and Methodology to Assess Generic Traffic Quality

1. What is click fraud?

Click fraud is usually defined as the act of purposely clicking on ads on pay-per-click programs with no interest in the target web site. Two types of fraud are usually mentioned:

  • An advertiser clicking on competitor ads to deplete their ad spend budgets, with fraud frequently taking place early in the morning and through multiple distribution partners: AOL, Ask.com, MSN, Google, Yahoo, etc.
  • A malicious distribution partner trying to increase its income, using clickbots or paid human beings to generate traffic that looks like genuine clicks.

While these are two important sources of non-converting traffic, there are many other sources of poor traffic. Some of them are sometimes referred to as invalid clicks rather than click fraud, but from the advertiser or publisher viewpoint, there is no difference. In this paper, we are considering all types of non billable or partially billable traffic, whether it is the result of fraud or not, whether there is or there is no intent to defraud, and whether there is or there is not a financial incentive to generate the traffic in question. These sources of undesirable traffic include:

  • Accidental fraud: a home-made robot not designed for click fraud purposes, running loose, out of control, clicking on every links, possibly because of a design flaw. An example is a robot run by spammers harvesting email addresses. This robot was not designed for click fraud purposes, nevertheless ended up costing money to advertisers.
  • Political activists: people with no financial incentives, but motivated by hate. This kind of clicking activity has been found against companies recruiting people in class action lawsuits, and results in artificial clicks and bogus conversions. It is a pernicious kind of click fraud because the victim thinks its PPC campaigns generate many leads, while in reality most of these leads (email addresses) are bogus.
  • Disgruntled individuals: it could be an employee working for a PPC advertiser or a search engine, who was recently fired. Or it could be a publisher who believes to be unjustifiably banned.
  • Unethical guys in the PPC community: small search engines trying to make their competitor look bad by generating unqualified clicks, or shareholder fraud.
  • Organized criminals: spammers and other internet pirates used to run bots and viruses, who found that their devices could be programmed to generate click fraud. Terrorism funding comes in this category, and is investigated by the both FBI and the SEC.
  • Hackers: many people have now access to home made web robots (the source code in Perl or Java is available for free). While it is easy to fabricate traffic with a robot, it is more complicated to emulate legitimate traffic as it requires spoofing thousands of ordinary IP addresses – not something any amateur can do well. Some individuals might find this as a challenge and generate high quality emulated traffic, just for the sake of it, with no financial incentives.
  • Traditional media losing market share to PPC advertising have incentive to contribute to click fraud.

In this paper, we will be even more general by encompassing other sources of problems not generally labeled as click fraud, but sometimes referred to as invalid, non-billable, or low-quality clicks. This includes

  • Impression fraud: impressions and clicks should always be considered jointly, not separately. This can be an issue for search engines, as their need to join very large databases and match users with both impressions and clicks. In some schemes, fraudulent impressions are generated to make a competitor’s CTR look low. Advanced schemes use good proxy servers (e.g. AOL) to hide the activity. When the CTR drops low enough, the competitor ad is not displayed anymore. This scheme is usually associated with self-clicking, a practice where an advertiser clicks on its own ads though proxy servers to improve its ranking, and thus improve its position in search result pages. This scheme targets both paid and organic traffic.
  • Multiple clicks: while multiple clicks are not necessarily fraudulent, they end up either (i) costing lots of money to advertisers when they are billed at the full price or (ii) costing lots of money to publishers and search engines if only the first click is charged for. Another issue is how to accurately determine that two clicks – say five minute apart – are attached to the same user.
  • Fictitious fraud: clicks that appear as fraudulent, but are never charged for. These clicks can be made up by unethical click fraud companies. Or they can be the result of testing campaigns, and we call them click noise. A typical example is Googlebot. While Google never charges for clicks originating from its Googlebot robot, other search engines that do not have the most updated list of Googlebot IP addresses might accidentally charge for these clicks. Another example of fictitious fraud further discussed in this paper is fictitious clicks. We explain what fictitious clicks are and how they can be detected.

2. A Black and White Universe, or is it Grey?

Our experience has shown that web traffic isn’t black or white, and that there is a whole range from low quality to great traffic. Also non converting traffic might not necessarily be bad, and in many cases can actually be very good. Lack of conversions might be due to poor ads, or poorly targeted ads. This raises two points:

  • Traffic scoring: while as much as 5% of the traffic from any source can be easily and immediately identified as totally unbillable, with no chance of ever converting, a much larger portion of the traffic has generic quality issues – issues that are not specific to a particular advertiser. A traffic scoring approach (click or impression scoring) provides a much more actionable mechanism both for search engines interested in ranking distribution partners, and for advertisers refining their ad campaigns.
  • A generic, universal scoring approach allows advertisers with limited or no ROI metrics to test new sources of traffic, knowing beforehand where the generically good traffic is, regardless of conversions. This can help advertisers substantially increase their reach and tap on new traffic sources as opposed to obtain very small ROI improvements from A/B testing. Some advertisers converting offline, victim of bogus conversions or interested in branding will find click scores most valuables.

A scoring approach can help search engines determine the optimum price for multiple clicks (here I mean true user-generated multiple clicks, not a double click that results from a technical glitch). By incorporating the score in their smart pricing algorithm, they can reduce the loss due to the simplified business rule “one click per ad per user per day”.

Search engine, publishers and advertisers can all win, as poor quality publishers can now be accepted in a network, but are priced correctly so that the advertiser still has a positive ROI. And good publisher experiencing drop in quality can have their commission lowered according to click scores, rather than being discontinued outright. When their traffic gets better, their commission increases accordingly, based on scores.

In order to make sense for search engines, a scoring system needs to be as generic as possible. The scores that we have developed meet this criterion. Our click scores have been designed to match the conversion rate distribution, using very generic conversions, taking into account bogus conversions, and based on patent-pending methodology to match a conversion with a click, through correct user identification. As everybody knows, an IP can have multiple users attached to it, and a single user can have multiple IP addresses within a two minute period. Cookies (particularly in server logs, less so in redirect logs) also have notorious flaws, and we do not rely on cookies when dealing with advertiser server log data.

We have designed scores based on click logs, relying – among other - on network topology metrics. We also have designed scores based on advertiser server logs, also relying on network topology metrics (distribution partners, unique browsers per IP cluster, etc.) and even on impression-to-click ratio and other search engine metrics, as we reconcile server logs with search engine reports to get the most accurate picture. Using search engine metrics to score advertiser traffic allow us to design good scores for search engine data, and the other way around as search engine scores are correlated with true conversions. It also makes us one of the very few third party traffic scoring company serving both sides equally well.

When dealing with advertiser server logs, the reconciliation process and the use of appropriate tags (e.g. Google’s gclid) whenever possible, allow us to not count clicks that are an artifact of browser technology. We have actually submitted a patent to eliminate what is called “fictitious clicks” by Google, and more generally, to eliminate clicks from clickbots.

Advertiser scores are designed to be a good indicator of conversion rate. Search engine scores use a combination of weights based both on expert knowledge and advertiser data. Score have been smoothed and standardized using the same methodology used for credit card scoring. The best quality assessment systems will rely on both our real-time and less granular scores, such as end-of-day.

The use of a smooth score, based on solid metrics, substantially reduce false positives. If a single rule is triggered, or even two rules are triggered, it might barely penalize the click. Also, if a rule is triggered by too many clicks or not correlated with true conversions, it is ignored. For instance, a rule formerly known as “double click” (with enough time between the two clicks) has been found to be a good indicator of conversion, and was changed from a rule into an anti-rule in our system, whenever the correlation is positive. A click with no external referral but otherwise normal will not be penalized, after score standardization.

3. Mathematical Model

The scoring methodology developed by Authenticlick is state-of-the art. It is based on almost 30 years of experience in auditing, statistics and fraud detection, both in real-time and on historical data. Several patents are currently pending.

It combines sophisticated cross-validation, design of experiments, linkage and unsupervised clustering to find new rules, machine learning, and the most advanced models ever used in scoring, with a parallel implementation and fast, robust algorithms to produce at once a large number of small overlapping decision trees. The clustering algorithm is a hybrid combination of unique decision-tree technology with a new type of PLS logistic stepwise regression to handle dozens of thousand highly redundant metrics. It provides meaningful regression coefficients computed in a very short amount of time, and efficiently handles interaction between rules.

Some aspects of the methodology show limited similarities with ridge regression, tree bagging and tree boosting. Below we compare the efficiency of different systems to detect click fraud on highly realistic simulated data. The criterion for comparison is the mean square error, a metric that measures the fit between scored clicks and conversions:

  • Scoring system with identical weights: 60% improvement over binary (fraud / non fraud) approach
  • First-order PLS regression: 113% improvement over binary approach
  • Full standard regression (not recommended as it provides highly unstable and non-interpretable results): 157% improvement over binary approach
  • Second-order PLS regression: 197% improvement over binary approach, easy interpretation and robust, nearly parameter-free technique

Substantial additional improvement is achieved when the decision trees component is added to the mix. Improvement rates on real data are similar.

4. Bogus Conversions

The reason we elaborate a bit on bogus conversions is because its impact is worse than most people think. If not taken care of, it can make a fraud detection system seriously biased. Search engines that rely on pre-sales or non-sales conversions such as sign-up forms to assess traffic performance can be misled into thinking that some traffic is good when it actually is poor, and the other way around.

Usually, the advertiser is not willing to provide too much information to the search engine, and thus conversions are computed generally as a result of the advertising placing some JavaScript code or a clear gif on target conversion pages. The search engine is then able to track conversions on these pages. However, the search engine has no control on which “converting pages” the advertiser wants to track. Also, the search engine has no visibility on what is happening between the click and the conversion, or after the conversion. If the search engine has access to pre-sale data only, the risk for bogus conversions is high. We have actually noticed a significant increase in bogus conversions from some specific traffic segment.

Another issue with bogus conversions is when an advertiser (let’s call it an ad broker) purchases traffic upstream, and then acts as a search engine and distributes the traffic downstream to other advertisers. This business model is widespread. If the traffic upstream is artificial but results in many bogus conversions – a conversion being a click or lead delivered downstream – the ad broker does not see a drop in ROI. She might actually see an increase in ROI. Only the advertisers downstream start to complain. Once the problem starts being addressed, it might be too late and can cost the ad broker to loose clients. Had the ad broker used a scoring system such as ours, the bogus conversions would have been detected early, even if the ROI was unchanged.

This business flaw can be exploited by criminals running a network of distribution partners. Smart criminals will hit this type of “ad broker” advertisers harder: the criminals can generate bogus clicks to make money themselves, and as long as they generate a decent amount of bogus conversions, the victim is making money too and might not notice the scheme. If the conversions are tracked by the upstream search engine (where the traffic originates), the clicks might erroneously be considered very good.

5. A Few Misconceptions

It has been argued that the victims of click fraud are good publishers, not advertisers as advertisers automatically adjust their bids. However, this does not apply to advertisers lacking good conversion metrics (e.g. if conversion takes place offline) nor smaller advertisers who do not update bids and keywords in real time. It can actually lead advertisers to permanently eliminate whole traffic segments, and lack the good ROI when the fraud problem gets fixed on the network. On some 2nd-tier networks, impression fraud can lead an advertiser to be kicked out one day, without the ability to ever come back. Both the search engine and the advertiser lose in this case, and the one who wins is the bad guys now displaying cheesy, irrelevant ads on the network. The website user loses too as all good ads have been replaced with irrelevant material.

Another point that we sometimes hear is that 3rd party auditors do not have access to the right data. Again, not only auditors with large volume of traffic can track network flows just like search engines do, but in addition they have access to more comprehensive conversion data, and are better equipped to detect bogus conversions. In our case, we process search engine and advertiser data: large volumes of data in both cases. However, some auditing firms lacking statistical expertise and / or domain knowledge have had serious flaws in their counting methodology. These flaws have been highly publicized by Google, and overestimated. Due to “fictitious clicks”, 1000 clicks are on average reported as 1,400 clicks by some auditing firms, according to a well known source. The 400 extra “non-clicks” or “fictitious clicks” (they really never existed) are said to be from users clicking on the back button of their browser. It is well known that most visits are just one-page long, and content displayed by back-clicking with your browser is usually served by the browser cache, not by the advertiser server logs. Thus this 1,400 / 1,000 ratio does not make sense. We believe that the issue is of a different nature, such as counting all http requests associated with one page as the click tags are attached to all requests, depending on server configuration. It is also an issue that we have addressed long ago.

Auditing firms performing good quality reconciliation also have access to many metrics typically used by fraud detection systems for search engines: average ad position, bid, impression-to-click ratio, etc.

Finally, many systems to detect fraud are still essentially based on outlier detection and detecting shifts from average. Based on our experience in the credit card fraud industry, we know that most fraudsters try very hard to look as average as possible, avoiding expensive or cheap clicks, using the right distribution of user agents, generating a small random number of clicks per infected computer per day, except possibly for clicks going through AOL or other proxies. This type of fraud needs a truly multivariate approach, looking at billions of combinations of several carefully selected variables simultaneously, looking for statistical evidence in billions of tiny click segments, to unearth the more sophisticated fraud cases impacting large volume of clicks, possibly orchestrated by terrorists or large corrupt financial institutions rather than distribution partners.