Image
Figure Caption

Figure 5—figure supplement 3. Expert similarity across all datasets: F1 scores.

The heatmaps show the mean of MF1 score at a matching IoU-threshold of t=0.5 for the image feature annotations of the indicated experts. The estimated ground truth (est. GT) was always calculated on all available expert annotations. The expert number refers to a unique human annotator (e.g. expert 1 is the same person across the datasets in A-D). The similarity scores were calculated on n = 5 images for A,B,C, and E and n = 9 images (test set) for D. The similarity between the same experts varies across the datasets (A–D), indicating that the heuristic bias of the annotators depends on the underlying data. However, expert 1 consistently yields the lowest similarity scores A, C, and D. The overall performance between one group of experts remains within a similar range for different datasets (A–D) and is comparable for a second group of experts on a different image dataset (E).

Acknowledgments
This image is the copyrighted work of the attributed author or publisher, and ZFIN has permission only to display this image to its users. Additional permissions should be obtained from the applicable author or publisher of the image. Full text @ Elife