Image
Figure Caption

Figure 1—figure supplement 2. Impact of dataset size, class balance, and model architecture on training performance.

(A) To test how size of the training dataset impacts model training, we took random subsamples from the MF–GC dataset and trained miniML models using fivefold cross validation. (BD) Comparison of loss (B), accuracy (C) and area under the ROC curve (AUC; D) across increasing dataset sizes. Data are means of model training sessions with k-fold cross-validation. Shaded areas represent SD. Note the log-scale of the abscissa. (E) Comparison of model training with unbalanced training data. (F) Area under the ROC curve for models trained with different levels of unbalanced training data. Unbalanced datasets impair classification performance. (G) Accuracy and F1 score for different model architectures plotted against number of free parameters. The CNN-LSTM architecture provided the best model performance with the lowest number of free parameters. EarlyStopping was used for all models to prevent overfitting (difference between training and validation accuracy <0.3%). ResNet, Residual Neural Network; MLP, multi-layer perceptron.

Acknowledgments
This image is the copyrighted work of the attributed author or publisher, and ZFIN has permission only to display this image to its users. Additional permissions should be obtained from the applicable author or publisher of the image. Full text @ Elife