%0 Journal article %A Rossel, Sven %A Martínez Arbizu, Pedro %T Unsupervised biodiversity estimation using proteomic fingerprints from MALDI-TOF MS data† %R 10.1002/lom3.10358 %R 10.23689/fidgeo-4684 %J Limnology and Oceanography: Methods %V 18 %N 5 %X Species identification using matrix assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) data strongly relies on reference libraries to differentiate species. Because comprehensive reference libraries, especially for metazoans, are rare, we explored the accuracy of unsupervised diversity estimations of communities using MALDI-TOF MS data in the absence of reference libraries to provide a method for future application in ecological research. To discover the best analysis strategy providing high congruence with true community structures, we carried out a simulation with more than 30,000 analyses using different combinations of data transformations, dimensionality reductions, and cluster algorithms. Species profile, Hellinger, and presence/absence transformations were applied to raw data and dimensions were reduced using principal component analysis (PCA), t-distributed stochastic neighbor embedding, and uniform manifold approximation and projection. To estimate biodiversity, data were clustered making use of partitioning around medoids, model-based clustering, and K-means clustering. The analyses were carried out on published mass spectrometry data of harpacticoid copepods. Most successful combinations (Hellinger transformation + PCA or raw data + partitioning around medoids) returned good values even for difficult species distributions containing numerous singleton species. Nevertheless, errors occurred most frequently because of such singleton taxa. Hence, replicative sampling in wide sampling areas for analysis is emphasized to increase the minimum number of specimens per species, thus reducing putative sources of errors. Our results demonstrate that MALDI-TOF MS data can be used to accurately estimate the biodiversity of unknown communities using unsupervised learning methods. The provided approach allows the biodiversity comparison of sampled regions for which no reference libraries are available. Hence, especially data on groups which demand a time-consuming identification or are highly abundant can be analyzed within short working time, accelerating ecological studies. %U http://resolver.sub.uni-goettingen.de/purl?gldocs-11858/9030 %~ FID GEO-LEO e-docs