A Transformer-Statistical Hybrid Approach for Arabic Text Summarization

Main Article Content

Wadeea R. Naji , Suresha , Mohammed A. S Al-Mohamadi, Fahd A. Ghanem, Ahmed R. A. Shamsan

Abstract

Arabic text summarization (ATS) is increasingly needed due to the rapid growth of textual data, especially on social media. We study the impact of preprocessing (normalization, stop-word removal, stemming) and representation (TF-IDF, AraBERT embeddings) on ATS. We also propose a TF-IDF–weighted AraBERT embedding to fuse contextual and statistical cues. Experiments on EASC with TextRank, LexRank, and LSA show that normalization and stemming improve performance, and the weighted embedding yields the best results (ROUGE = 0.573; BLEU = 0.348). The surge of Arabic digital content has amplified the need for effective ATS systems. However, performance depends strongly on text preprocessing and representation choices, which are under-explored for Arabic and social-media-style text.


Objectives:



  1. Quantify the effect of core preprocessing steps on ATS quality.

  2. Compare TF-IDF and AraBERT representations.

  3. Propose and evaluate a TF-IDF–weighted AraBERT representation.

  4. Benchmark TextRank, LexRank, and LSA on the EASC dataset using ROUGE and BLEU.


Methods: We use the Essex Arabic Summaries Corpus (EASC). Preprocessing includes normalization, stop-word removal, and stemming. Representations are (a) TF-IDF, (b) AraBERT embeddings, and (c) a weighted word embedding that multiplies AraBERT token vectors by their TF-IDF weights and aggregates. Summaries are produced by TextRank, LexRank, and LSA. Evaluation uses ROUGE and BLEU.


Results: Normalization and stemming consistently improve ROUGE/BLEU across models. The proposed TF-IDF–weighted AraBERT representation achieves the best overall performance, reaching ROUGE = 0.573 and BLEU = 0.348 on EASC.


Conclusions: Carefully chosen preprocessing and a hybrid representation that fuses contextual (AraBERT) and statistical (TF-IDF) signals substantially boosts ATS quality. The simple, model-agnostic weighted embedding is effective with classic extractive methods and provides a strong baseline for future Arabic summarization research.

Article Details

Section
Articles