Latest SEP (Search Engine Poisoning) Research, Part 3
[This is part three of a series of blog posts providing some of the backstory for my RSA presentation on Search Engine Poisoning. There was a lot of material that simply wouldn't fit into a 45-minute presentation...]
WHO'S THE SAFEST SEARCH ENGINE?
So, late last Summer, seeing that SEP was such a dominant attack vector, we talked about what sort of research focus would make for an interesting presentation at RSA. One intriguing idea was to compare several leading search engines, and see how they did at keeping SEP links out of their results (for the same search). This was based on an idea we'd explored back in 2009, but decided not to pursue at the time.
To give you the flavor of that old research, I went back to my e-mail archives and found the following table where I'd summarized the idea and included a few examples. In each case, the same query was typed into each search engine, and the number of SEP links in the Top 10 results were counted. (Again, this is from old data, so its only current value is to provide some historical perspective.)
|Safe, hobby-type query||1||5||2||6|
|Adult escort query||2||4||8||10|
|Non-English porn query||10||10||10||3 (out of 3)|
TESTING (WHICH ENGINES?):
The most basic research question was to ask if Google was still measurably superior to Bing and Yahoo at SEP filtering, and if Baidu still trailed the "big three". I also decided to expand the number of search engines involved, to make things more interesting. (Especially since Bing and Yahoo have combined their efforts, and are now, for all intents and purposes, the same engine.)
Yandex, the big Russian search engine, was added as another representative example of a non-English engine, and Google's .HK and .RU versions were added for direct comparison with Baidu and Yandex. Rounding out the list was one you may not have heard of: k9safesearch.com, which requires a bit of explanation...
The target audience for our free Web Filter, K9, was originally families who wanted to keep their kids safe from inappropriate content on the Web. One long-time concern was Image Search, where even with "safe search" turned on, it is distressingly common (from a parent's point of view) to have adult images show up for innocent search terms. So the K9 team decided to build their own search engine, with the goal of making the world's safest image search. Behind the scenes, the basic idea is to post-filter the search results against our database of site ratings, removing links to sites rated Adult, Pornography, Violence/Hate/Racism, and so forth, before displaying the results. Consequently, I was curious to see how it would do as far as SEP links, even though that wasn't what it was designed for.
(Also of interest: Toward the end of my testing, I remembered blekko.com. Since I was already wrapping up research on the "safest search engine" question, and moving toward researching the other questions, I didn't take time to go back and re-run all the tests formally, but did do some informal testing. Conclusions below.)
In addition to an expanded list of search engines, the other major differences from the 2009 preliminary research involved using a much larger test set of search terms this time around, and where those search terms came from.
In the 2009 experiments, the search terms were gathered from "link-spam" sites. These are areas where user-generated content can be posted to sites: guest books, comment pages, and the like. The SEP guys have their bots spam links into these areas, hoping to attract the attention of the search engines, so that the links will be followed and their bogus content pages will be indexed. Since the links typically include a few key words, we could simply grab those keywords and use them as the queries in the search engines. (The basic theory being, "the Bad Guys are targeting these search terms; are they being successful?")
The big problems with this approach were that there was no guarantee that a search engine would actually find and index the SEP target page, or that a user would later search for those same search terms and find that page. It turned out to be a rather labor-intensive approach, which is the main reason we never did a formal research project, just the preliminary stage.
This time around, I decided to focus instead on the WebPulse malnet logs; specifically, on entries where we knew it was an SEP attack, and could see the search terms that the user had entered prior to clicking on the SEP link. The advantage of this approach is that every such entry represents a "successful" SEP attack -- that is, a user searched for the terms that the SEP gang had targeted, saw an SEP link in the top results returned, decided that it looked legitimate, and clicked it. (It wasn't a truly successful attack, of course, since we detected and blocked it, but that's not the fault of the Bad Guys....)
Okay, that's enough background. Let's look at some selected results, and see who the Safest Search Engine really was, as of October-November 2011, when the research was done. (Note that numbers with an asterisk or tilde are explained further below.)
|Query||Bing/Yahoo||Baidu||Yandex||Google .HK||Google .RU||K9safesearch|
|"windows 7 ultimate download"||1*||0*||0*||0*||2*||0*||0*|
|"panera bread prices"||0||~1||2||~1||~1||~1||~1|
|"mario brothers pumpkin stencils"||1||0||9||5||2||0||0|
|[porn star name]||1||1*||0||3||0||1||0|
|"sheneneh jenkins for halloween"||1||1||0||0||2||1||0|
|"funny curling team names"||4||3||6||10||5||5||1|
|"air force tax advisor policy letter"||2||3||2||8||4||2||0|
|"halimbawa ng batutian"||8||9||9||5||7||7||8|
Yay! k9safesearch.com is the winner!
Well, my conscience is bothering me a bit. You see, I cheated...
The very first day of the research, I noticed that many of the SEP links showing up in all of the search engine results were links hosted on "Dynamic DNS" hosts. So I told the K9 team, "Hey guys, if you just strip all DynDNS links out of your results, you'll remove a lot of SEP attacks, and you won't really be hurting your results, since there isn't a lot of high-quality content on most of those sites."
They took my advice, rolled out a quick update, and that's the one I used in all of the testing. So, sorry about cheating, but since I publicly announced at RSA what we had done (in case anyone from Google or Microsoft was in the audience), and I'm publicly disclosing it here as a recommendation for them to consider, I think we're square.
SUMMARY / OBSERVATIONS:
- Bing/Yahoo has caught up with Google. I consider them to be "equivalently safe". (Blekko also seemed to be a solid choice, although I hadn't remembered it in time to include it in testing from the beginning.)
- Baidu and Yandex, on the other hand, have not caught up. They consistently allow a higher number of SEP links to show up in their pages.
- Curiously, Google's own Chinese and Russian versions did slightly (but consistently) worse than the English version.
- Non-English SEP attacks consistently placed a higher number of SEP links in the Top 10 results than English SEP attacks.
(Considering all those points together, it's pretty clear that the algorithms used to filter out malicious/suspicious pages are simply more mature and effective for English language content.)
- "Shady content" searches (pirated software, porn star) did not return higher numbers of SEP links than "safe content" searches. In fact, the opposite was true.
(Note: The numbers marked with an asterisk for the "windows 7 ultimate download" search indicate that I was not counting "hacking/warez" type sites in the results as SEP-type links. You expect to get those kinds of sites in such a search. Non-zero results on this search are for actual malicious/suspicious SEP-type links. However, Bing's "1*" result in the [porn star name] search was more of a porn site than an SEP site; I decided to count it as a failure since the "safe search" flag was on, and no porn sites should have shown up.)
- The tilde-flagged numbers on the "panera bread prices" search were "borderline" SEP links: probably not malicious, but more "junk" types of sites.
THE NEXT QUESTION(S):
As mentioned, the question of "who's the safest search engine?" was the original research focus in this project. However, long before I completed the testing for the 100-200 samples I was planning to look at in detail, the above Observations were becoming pretty obvious, and I was losing enthusiasm at the prospect of so much remaining "busy work". Fortunately, a number of other interesting questions were being suggested by the data, so the scope of the research expanded a bit. These other questions will be covered in the next blog posts.