Topic modeling through rank-based aggregation and LLMs: An approach for AI and human-generated scientific texts

No Thumbnail Available

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The increasing presence of AI-generated and human-paraphrased content in scientific literature presents new challenges for topic modeling, particularly in maintaining semantic coherence and interpretability across diverse text sources. Traditional topic modeling methods, such as Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF), often suffer from inconsistencies and diminished coherence when applied to heterogeneous sources. Recently, large language models (LLMs) have demonstrated potential for enhanced topic extraction, yet they frequently lack the stability and interpretability required for reliable deployment. In response to these limitations, we propose a novel, robust ensemble framework that integrates rank-based aggregation and LLM-powered topic extraction to achieve consistent, high-quality topic modeling across AI-generated, AI-paraphrased, and human-generated scientific abstracts. Our framework employs a rank-based aggregation scheme to reduce inconsistencies in LLM outputs and incorporates neural topic models to enhance coherence and semantic depth. By combining the strengths of traditional models and LLMs, our framework consistently outperforms baseline methods in terms of topic coherence, diversity, and stability. Experimental results on a diverse dataset of scientific abstracts demonstrate a substantial improvement in coherence scores and topic interpretability, with our ensemble approach outperforming conventional models and leading neural topic models by significant margins. This framework not only addresses the challenges of cross-source topic modeling but also establishes a benchmark for robust, scalable analysis of scientific literature spanning AI and human narratives.

Description

Keywords

Citation