Open Source
What is open source Winnow?
Open source Winnow is a very high performance content recommendation engine that efficiently trains and operates any number of unique classifiers on large sets of content.
Winnow was developed over the course of a number of years, and has been heavily used by the winnowTag web application during that time. So as we now first release Winnow as open source in 2010, it is already proven to be both usable and stable.
How is Winnow different than a typical web search engine?
Often web search engines have a minimal (e.g., keyword-based) model of relevance and prioritize content according to popularity. Thus web search engines are best at finding what other people are interested in. Winnow can be used to find what an individual user is interested in, based on examples indicated by the user.
How does Winnow work? How do I integrate Winnow with my application?
Winnow runs as a daemon process with an embedded HTTP server and implements multithreaded processing of an internal queue of classification jobs. Job control includes status and error reporting.
Your application communicates with Winnow via REST, using HMAC authentication if Winnow is running on a separate machine.
Winnow supports any number of classifiers. For example, in our web application winnowTag for end users, each smart tag has its own classifier that is trained on the items to which that tag has been manually applied.
Winnow’s classification algorithm is evolved from SpamBayes.
For more information read this detailed description of Winnow architecture.
How fast is Winnow?
During its several years of development Winnow has been extensively performance-tuned, including caching.
The winnowTag web application runs Winnow on a dedicated server (though a dedicated server is not required). This server has four Intel Xeon E5410 (quad-core) 2.33GHz processors and 12GB RAM. But the following details on performance are with respect to using only a single one of these quad-core processors.
When the examples (training) for a winnowTag tag are changed, a new Winnow classifier is trained using that revised set of examples. Training that new classifier takes Winnow as little as 0.2 seconds. Then all of winnowTag’s content is reclassified by that new classifier at about 18,000 feed items per second.
Winnow lets you maintain classification incrementally, classifying only new content. When 200 new items of content are added to winnowTag, it takes Winnow about 0.014 seconds per classifier to classify those 200 new items. So it would take about 15 seconds for 1,000 existing classifiers to classify 200 new items of content.
How was Winnow developed?
Winnow’s classification algorithm is based on existing practical naive Bayes text classifiers (e.g. SpamBayes), significantly evolved to improve classification accuracy and performance. Our revised version works well with very small training and unbalanced training sets. It’s been extensively tested using cross-validation on hand-tagged corpuses, and then in active use.
We initially implemented Winnow in pure Ruby. For performance reasons we then tried Ruby with C extensions and Erlang. In the end we found the current implementation in C provides the best performance. This C version is about 35x faster than pure Ruby, about 4 times faster than Ruby with C extensions, and about 5 times faster than Erlang. Both Ruby solutions also had serious limitations in regard to parallel processing. At the time parallel processing wasn’t possible in a scalable, efficient way since all Ruby 1.8 threads are lightweight threads that run in a single OS thread.
Requirements
Winnow is implemented in C for the POSIX platform. Currently we use it with OS X and CentOS. It should work fine in most any flavor of Linux, etc. (Please let us know if you port it to Windows.)
Resources
Winnow documentation pages in Github wiki
Requirements and build instructions – INSTALL
Setup and operation – README
Github repository
We like questions and comments! Please contact us. And if you uncover a bug, open an issue in that repository in GitHub.