Skip to Content
DigiCon AnalytX
  • Home
  • Services
  • Why DigiCon AnalytX
  • Meet The Founder
  • Ask an expert
  • Blog
  • DigiCon Learning Hub
  • About Us
  • Contact us
  • +92-332-440-1610
DigiCon AnalytX
      • Home
      • Services
      • Why DigiCon AnalytX
      • Meet The Founder
      • Ask an expert
      • Blog
      • DigiCon Learning Hub
      • About Us
      • Contact us
    • +92-332-440-1610

    A Large and Diverse Arabic Corpus for Language Modeling

    Abbas Raza Ali a , Muhammad Ajmal Siddiqui b , Rema Algunaibet c , Hasan Raza Ali d
  • All Blogs
  • Business Analytics
  • A Large and Diverse Arabic Corpus for Language Modeling
  • July 3, 2025 by
    Ali rizwan

    Abstract:

    Large Language Models (LLMs) have ushered in a major paradigm shift in Natural Language Processing (NLP), where large pre-trained Language models (LMs) have become a fundamental component of most NLP tasks. These models are intelligent enough to find relevant and meaningful representations of a language without any supervision. They are used to fine-tune typical NLP tasks with substantially higher precision than conventional shallow learning techniques. However, training these models requires a massively large corpus that adequately represents a language. Due to the availability of enormous corpora, English LLMs typically perform better than their counterparts.

    Large Learning Model

    This effort focuses on the design and development of a large Arabic corpus. The corpus comprises over 500 GB of Arabic cleaned text, intended to improve cross-domain knowledge and downstream generalization capability of LLMs. The corpus was employed in the training of a large Arabic LLM. In order to assess the efficacy of the LLM, a variety of typical NLP tasks were fine-tuned. The fine-tuned tasks exhibited a significant boost in accuracy ranging between 4.5 and 8.5%, when compared to those downstreamed from multi-lingual BERT (mBERT). To the best of our knowledge, this is currently the largest clean and diverse Arabic corpus ever assembled.


    link to the Article:

     https://www.sciencedirect.com/science/article/pii/S1877050923011419

    in Business Analytics
    Ali rizwan July 3, 2025
    Share this post
    Tags
    Our blogs
    • Digital Marketing
    • Business Analytics
    • Our blog
    • Artificial Intelligence
    • Mobile Application
    Archive
    🤖 Data Science & AI Course – Become a Data Pro
       Useful Links           
    • Home
    • Services
    • Why DigiCon AnalytX
    • Meet The Founder
    • Contact Us
    • Ask an expert
    • Blog 
    • DigiCon Learning Hub
    • About us
    • Privacy Policy
    • Cookie Policy


    About us
    At DigiCon AnalytX, we blend digital marketing with business analytics to drive smart growth. Our data-driven strategies deliver real results. We also provide professional coaching to equip individuals and teams with essential skills in marketing  analytics and digital transformation.

    Whether you're a startup looking for traction or a brand aiming to scale, we turn your data into decisions and your marketing into measurable success.
           Connect with us
    •     support@digiconanalytx.com
    •     +92-304-666-4553
    •         +92-332-440-1610
    Follow us
    Copyright © 2025 DigiCon AnalytX
    Powered by Odoo - Create a free website

    We use cookies to provide you a better user experience on this website. Cookie Policy

    Only essentials I agree