Multiple scoring systems based e.g. on IP and click scores, scorecards and metric mix optimization are the basic ingredients. Because of the vast amount of data, and potentially millions of metrics used in a good scoring system, combinatorial optimization is required, using algorithms such as Markov Chain Monte Carlo or simulated annealing.
While scoring advertiser data can be viewed as a regression problem, the dependent variable being the conversion metric, scoring search engine data is more challenging as conversion data is not readily available. Even when dealing with advertiser data, we have several issues to address. First, the scores need to be standardized. Two identical ad campaigns might perform very differently if the landing pages are different. The scoring system needs to address this issue.
Also, while scoring can be viewed as a regression problem, it is a very difficult one. First, the metrics involved are usually highly correlated, making the problem ill-conditioned from a mathematical viewpoint. There might be more metrics (and thus more regression coefficients) than observed clicks, making the regression approach highly unstable. Finally, the regression coefficients - also referred to as weights - must be constrained to take only a few potential values. The dependent variable being binary, we are dealing with a sophisticated ridge logistic regression problem.
The best technology will actually rely on an hybrid system that can handle contrarian configurations, such as "time < 4am" is bad, "country not US" is bad, but "time < 4am and country = UK" is good. Good cross validation is also critical to eliminate configurations and metrics with no statistical significance or poor robustness. Careful metric binning, and a fast distributed feature optimization algorithm is important as well.
Finally, design of experiments to create test campaigns - some with high proportion of fraud and some with no fraud - as well as usage of generic conversion and proper user identification is critical. And let's not forget that failing to remove bogus conversions will result in a biased system with many false positives.