WebPulse Adds Hebrew and Indonesian Modules
I'm trying to catch up on things this week after spending a week in Japan (speaking at a security conference, working with the Blue Coat team, eating a lot of great food, and working on my rusty Japanese skills), and it's time to get back to the blog...
I really meant to do a blog post when we shipped the Hebrew module earlier this year, but other work crowded it out. (I wrote about the Norwegian module when it shipped last year, and intend to write about each new or revised language module we ship, but sometimes other things -- like malware -- take precedence.) Now that we've shipped our Indonesian module, it means I'm two languages behind.
So, to catch up on WebPulse's current language status:
- The Hebrew module shipped back in early April; it's a full-blown module, returning ratings in 20 categories.
- The Indonesian module shipped in late June; it's a smaller module, supporting 6 categories.
- That brings us to 20 "real" languages supported for dynamic content analysis. (We are unaware of anything comparable to this breadth and depth of real-time multilingual classification in any competing Web filter.)
- Language #21 is an artificial language known as Pornovian, which we use to identify pornographic content in languages that we haven't done a full classifier for. (Many people at Blue Coat have heard about this module, but it's not generally known outside of the company -- it's never had a blog post of its own, but that may change someday...)
- Less well known, even inside Blue Coat, is that there is actually a 22nd language module, simpy named "Other". Like Pornovian, this module is focused on detecting pornographic content in languages that do not yet have a dedicated analysis module. (A quick look at yesterday's traffic logs shows that it caught porn in Urdu, Lithuanian, Slovak, Romanian, and Estonian pages, among others.)
All of this real-time content analysis matters to WebPulse, of course, since it helps flesh out the picture of what's "normal" Web traffic and what isn't, and the metadata collected from each of these pages feeds into the various real-time and background modules that hunt for shady traffic and content.
So kudos to David Meyer, who leads the language module work, and deserves more recognition than he gets.