Why LLMs Perform Better on Jumbled English Sentences than Bengali

Introduction

Large Language Models (LLMs) like ChatGPT are remarkably good at reconstructing meaningful English sentences from jumbled words. However, the same capability often appears weaker for Bengali. This is due to differences in training data, tokenization, linguistic structure, and probabilistic prediction mechanisms.

1. How LLMs Predict Words

P(w₁, w₂, ..., wₙ) = Π P(wᵢ | w₁, ..., wᵢ₋₁)

The model predicts the next token based on prior tokens using learned probabilities.

2. Example: English Reconstruction

Input:

percent nine genius and perspiration ninety percent is one inspiration

Output:

“Genius is one percent inspiration and ninety-nine percent perspiration.”

This works well because the model has seen this famous quote frequently.

Why it works:

3. Example: Bengali Reconstruction

Input:

বেলা সুরে সুর সাঁঝবেলাতে তোমার যে মেলাতে আমার যায় সুরে

Expected Output:

“আমার সাঁঝবেলাতে যে তোমার সুরে সুরে মেলাতে যায়।”

Why it is harder:

4. Comparison

Aspect English Bengali
Training Data High Moderate/Low
Phrase Familiarity High Low
Word Order Rigid Flexible
Tokenization Strong Weaker
Accuracy High Lower

5. Internal Processing Steps

  1. Token identification
  2. Pattern recognition
  3. Probability maximization
  4. Sentence generation

6. Not a Language Limitation

This is not because the model cannot understand Bengali. It is due to statistical confidence based on available data and structure.

7. Improvements

Conclusion

LLMs reconstruct sentences by maximizing probability over learned patterns. English performs better due to higher training exposure and rigid structure. Bengali introduces ambiguity and complexity, making reconstruction more challenging.