Chanin Nantasenamat
1 min readJul 26, 2020

--

Thanks Dan for reading and for the thought-provoking question. Personally, I have been using 80/20 splits for most of the research papers that I publish in Bioinformatics. i also see other peers doing the same. I think it is an arbitrary number inspired by the Pareto's 80/20 principle. 5-fold CV would essentially give 80% for the training set (the 4-fold) and 20% for the test set (the 1 left out fold). Another common N-fold is the 10-fold CV.

As for the high number of folds, we would definitely do that for leave-one-out cross-validation. So for a dataset size of N=30 then a 30-fold CV would equate to leave-one-out cross-validation (LOO-CV). Thus, the LOO-CV scheme is recommended for when the dataset size is small so that we maximize the usage of the data.

As for other N-fold numbers, I've even seen a research paper proposing the use of a 2-fold Monte Carlo cross-validation and I've also seen a double nested cross-validation. Perhaps, there are other potentially robust rationale for selecting N-fold that is waiting to be discovered.

😃

--

--

Chanin Nantasenamat
Chanin Nantasenamat

Written by Chanin Nantasenamat

Data Professor on YouTube | Sr Developer Advocate | ex-Professor of Bioinformatics | Join https://data-professor.medium.com/membership

No responses yet