The Analytics X Prize seeks to reproduce what Netflix did for movie recommendation prediction to crime prediction. The goal is to predict the proportion of homicides in each of the 47 zip codes in metropolitan Philadelphia. This kind of prediction allows the city police departments to allocate resources in a smart and effective fashion.
One interesting aspect of the contest is that it does not specify what data one should use nor does it provide any data. Therefore, it is up to each individual or team to seek out public sources of data for modeling and analysis. This should lead to quite a variety of different approaches to the problem.
A naive approach to this problem would be to implement a linear model of the proportion of homicides per zip code based on a variety of variables like census and demographic data, number of police departments per zip code, etc. This approach does away with the spatial aspect of this problem almost entirely by converting it into a strictly linear model. However, this ignores the fact that high crime in one zip code is likely to be an indicator of crime in nearby zip codes due to similarity. It also ignores the fact that criminals have conscious and unconscious preferences for where they choose to commit crimes. Of course, so called crimes of passion like homicide don’t necessarily show as much spatial preferences as robbery. Also, analyzing only zip codes is a fairly low resolution approach to this kind of spatial analysis.
The only reservation I have against choosing another modeling approach is that the evaluation technique for the contest – RMSE of proportion of homicides per zip code – lends itself to a linear model approach. But a criticism of the evaluation method is a topic for another post.
A better approach would be to implement some kind of spatial model of criminal preferences at a finer resolution than zip codes. This approach would leverage the wealth of publicly available GIS data as well as geolocated homicide data available from the Philadelphia Police Department’s website. My first pass at this problem used this approach. I had to aggregate my finer grained prediction into zip code resolution in order to submit to the contest. A threat map of the first pass prediction is shown below.