Image
Figure Caption

Figure 2—figure supplement 3. The heatmap shows the mean of MF1 score at a matching IoU-threshold of t=0.5 for the image feature annotations of the indicated experts. Segmentation masks of the five human experts (Nexpert = 1 per expert), the estimated ground-truth (Nest. GT = 1), the respective expert models, the consensus models, and the consensus ensembles (Nmodels = 4 per model or ensemble) are compared. The diagonal values show the inter-model reliability (no data available for the human experts who only annotated the images once). The consensus ensembles show the highest reliability (0.94) and perform on par with human experts compared to the est. GT (0.77). Both expert 1 and the corresponding expert 1 models show overall low similarities to other experts and expert models, while sharing a high similarity to each other (0.73).

Acknowledgments
This image is the copyrighted work of the attributed author or publisher, and ZFIN has permission only to display this image to its users. Additional permissions should be obtained from the applicable author or publisher of the image. Full text @ Elife