Similarity Analysis Explained
The formula itself is rather simple – I took the average and standard deviations of each category* for the whole sample, and each player received a z-score (which is the given value minus the average, all divided by the standard deviation, so that an average value gets a z-score of zero and a value one standard deviation above average gets a z-score of one) that gave their normalized statistical value for each category. From there, the differences of the player’s z-scores and the z-scores from the entire sample are taken, and the combined differences from each statistic are added together to form a “similarity score.” A player, when compared with himself, would provide a similarity score of zero, as there’s no difference between the z-scores. The lower similarity scores are, the closer the players are, comparatively speaking. A similarity score of six (which is the difference between Caris LeVert’s sophomore and E’Twaun Moore’s freshman seasons) would indicate that, on average, the z-scores for each statistic differed by 0.3 standard deviations.
All 20 categories are valued equally in this system – as shooting percentages in various forms form five of those categories, and with three rebounding rates included in there, those statistics are perhaps valued fairly, while say, assist and turnover rates could be considered underrated, but for simplicity’s sake (and instead of giving arbitrary weights to certain values), that’s how the system works. With seven full seasons (from 2008 to 2014), there are a total of 750 players in the database, so there’s a decently sized sample from which comparisons can be drawn. Statistical systems – especially ones like this – are really never perfect, but that isn’t really my goal. The potential uses of this project are pretty vast, and while it’s never going to be used to find perfectly one-to-one comparisons for anyone, it’s a great jumping-off point for comparative evaluations and the discussions that stem from there.
*the inputs for the system are as follows: the percentage of available minutes played (%Min), the percentage of possessions used while on the floor (%Poss), the percentage of shots taken while on the floor (%Shots), offensive rating (ORtg), effective field goal percentage (eFG%), true shooting percentage (TS%), two-point field goal percentage (2P%), three-point field goal percentage (3P%), the percentage of field goal attempts that were three-pointers (3PA%), free throw percentage (FT%), fouls drawn per 40 minutes (FD/40), free throw rate (FTRate), assist rate (ARate), turnover rate (TORate), offensive rebounding rate (OR%), defensive rebounding rate (DR%), total rebounding rate (TR%), block rate (Blk%), steal rate (Stl%), and fouls called per 40 minutes (FC/40).
Alex Cook, the author of this post, can be found on Twitter @alexcook616. Any comparison requests are more than welcome.