Skip to Content

A Large and Diverse Arabic Corpus for Language Modeling

Abbas Raza Ali a , Muhammad Ajmal Siddiqui b , Rema Algunaibet c , Hasan Raza Ali d

Abstract:

Large Language Models (LLMs) have ushered in a major paradigm shift in Natural Language Processing (NLP), where large pre-trained Language models (LMs) have become a fundamental component of most NLP tasks. These models are intelligent enough to find relevant and meaningful representations of a language without any supervision. They are used to fine-tune typical NLP tasks with substantially higher precision than conventional shallow learning techniques. However, training these models requires a massively large corpus that adequately represents a language. Due to the availability of enormous corpora, English LLMs typically perform better than their counterparts.

Large Learning Model

This effort focuses on the design and development of a large Arabic corpus. The corpus comprises over 500 GB of Arabic cleaned text, intended to improve cross-domain knowledge and downstream generalization capability of LLMs. The corpus was employed in the training of a large Arabic LLM. In order to assess the efficacy of the LLM, a variety of typical NLP tasks were fine-tuned. The fine-tuned tasks exhibited a significant boost in accuracy ranging between 4.5 and 8.5%, when compared to those downstreamed from multi-lingual BERT (mBERT). To the best of our knowledge, this is currently the largest clean and diverse Arabic corpus ever assembled.


link to the Article:

 https://www.sciencedirect.com/science/article/pii/S1877050923011419

Ali rizwan July 3, 2025
Share this post
Tags
Archive
🤖 Data Science & AI Course – Become a Data Pro