Jiménez-Valverde, A., Lobo, J.M., & Hortal, J. (2009) The effect of prevalence and its interaction with sample size on the reliability of species distribution models. Community Ecology, 10, 196–205. doi:10.1556/ComEc.10.2009.2.9
Prevalence (the presence/absence ratio in the training data) is commonly thought to influence the reliability of the predictions of species distribution models. However, little is known about its precise impact. We studied its effects using a virtual species, avoiding the presence of unaccounted-for effects in the modeling process (false absences, non-explanatory predictors, etc.). We sampled the distribution of the virtual species to obtain several data subsets of varying sample size and prevalence, and then modeled these data subsets using logistic regressions. Our results show that model predictions can be highly accurate over a wide range of sample sizes and prevalence scores, provided that the predictors are truly related to the distribution of the species and the training data are reliable. The effect of sample size becomes apparent for datasets of less than 70 data points, and the effect of prevalence is significant only for datasets with extremely unbalanced samples (<0.01 and >0.99). There is also a strong interaction between sample size and prevalence, indicating that the most negative factor is the sample size of each event (absence and/or presence), and not biased prevalence, as previously thought. We suggest that, in the real world, an interaction must exist between the sample size of each event and the quality of the training data. We discuss that biased prevalences can be a desirable property of the data, instead of a problem to be avoided, also pointing out the importance of using the best absence data possible when modeling the distribution of species of narrow geographic range.