scExtract: leveraging large language models for fully automated single-cell RNA-seq data annotation and prior-informed multi-dataset integration by Yuxuan Wu & Fuchou Tang instant download
Abstract 1 Biomedical Pioneering Single-cell RNA sequencing has revolutionized cellular heterogeneity research, but analyzing the abundance of unannotated public datasets remains challenging. We Innovation Center, School of Life Sciences, Peking University, present scExtract, a framework leveraging large language models to automate scRNABeijing 100871, China seq data analysis from preprocessing to annotation and integration. scExtract extracts 2 Beijing Advanced Innovation information from research articles to guide data processing, outperforming existing Center for Genomics (ICG), Ministry of Education Key reference transfer methods in benchmarks. We introduce scanorama-prior and cellhintLaboratory of Cell Proliferation prior, which incorporate prior annotation information for improved batch correction and Diferentiation, Beijing, Chinawhile preserving biological diversities. We demonstrate scExtract’s utility by integrating 14 datasets to create a comprehensive human skin atlas of 440,000 cells. Keywords: Single-cell RNA sequencing, Large language models, Dataset integrationBackgroundSince the advent of single-cell RNA sequencing technology [1], facilitated by breakthroughs in experimental procedures and sequencing platforms, the growth of publicly available single-cell sequencing data has continuously expanded very rapidly. Decreasing costs and widespread adoption of commercialized protocols, such as the 10X Genomics platform, have made single-cell RNA sequencing ubiquitous across biological disciplines. To mitigate resource wastage from redundant sequencing, third-party public datasets have become indispensable for research discovery and validation. Consequently, large-scale, curated single-cell “atlas” datasets have emerged for critical and complex diseases [2].Collaborative eforts like the Human Cell Atlas (HCA) [3] and crowdsourcing platform such as the cellxgene [4] platform have driven the generation of extensive, multi-species, cross-tissue, and mult
*Free conversion of into popular formats such as PDF, DOCX, DOC, AZW, EPUB, and MOBI after payment.