I’m in the process of reading and working along with Programming Collective Intelligence by Toby Segaran. The book (so far) presents applied statistics through the lens of Big Data (“Web 2.0” as the book, published 2007, so quaintly refers to). The first two recommendation techniques covered use Euclidean Distance and Pearson Correlation to determine the similarity between two rankings and sets of rankings as a whole, respectively.
The gist of the first approach is to compare two movie critics’ movie rankings to generate a similarity score. In Euclidean space the author plots the two critics’ ranking for two movies along X and Y axes. Then the distance between the two points is calculated with the Pythagorean Theorem (for the 2 dimensions). One thing I immediately was curious about is how the equation would scale. A search for Euclidean Distance shows the computation cost for N dimensions remains linear. I really like this method’s clarity and relative simplicity; while I’m sure there would be copious eye-glazing at a typical workplace presentation, the method is fairly easy to graph and picture the basic concept.
Slightly more complex, but also useful (and possibly more mathematically sound) is the Pearson Correlation Coefficient. The coefficient in this example is determined by a best-fit linear regression. One key variation from Euclidean Distance is in the choice of axis values. In Euclidean Distance the ranking scale for each movie (1 through 5) was selected for each axis and the critic was plotted accordingly. In the Pearson example, the author chooses the ranking scale (1 through 5) for each critic and plots the various movies according to their mutual scores. For example, if Bob (x axis) and Sally (y axis) gave Office Space a 1 and 3 respectively, Office Space would be plotted at (1.3) in the space. The resulting graph looks something like a scatter plot for random values. Pearson Correlation Coefficient draws the line closest to all the points possible. The slope of that line is the coefficient, or r value that is returned in the author’s function and tells us how similar the two variables are, with r ranging from -1 to 1. Wikipedia does a great job showing examples of plots resulting in -1, 0, 1, and in-between values. I conceptualize the coefficient possibilities by considering a slope (r value) of 1 would mean the two critics had the exact same scores for every movie. The plotted coordinates would be (1,1), (2,2), (3,3), etc.
While Euclidean Distance is straightforward and can be explained conceptually with a graph, Pearson Correlation Coefficient would be lost on most, and a higher order implementation is in the realm of math majors.
Neat stuff, and much more to come!