Why LLMs Perform Better on Jumbled English Sentences than Bengali

Introduction

Large Language Models (LLMs) like ChatGPT are remarkably good at reconstructing meaningful English sentences from jumbled words. However, the same capability often appears weaker for Bengali. This is due to differences in training data, tokenization, linguistic structure, and probabilistic prediction mechanisms.

1. How LLMs Predict Words

P(w₁, w₂, ..., wₙ) = Π P(wᵢ | w₁, ..., wᵢ₋₁)

The model predicts the next token based on prior tokens using learned probabilities.

Tokenization: Text split into tokens
Embeddings: Tokens converted to vectors
Attention: Captures relationships
Probability: Selects most likely next token

2. Example: English Reconstruction

Input:

percent nine genius and perspiration ninety percent is one inspiration

Output:

“Genius is one percent inspiration and ninety-nine percent perspiration.”

This works well because the model has seen this famous quote frequently.

Why it works:

High-frequency training data
Strong phrase patterns
Fixed sentence structure

3. Example: Bengali Reconstruction

Input:

বেলা সুরে সুর সাঁঝবেলাতে তোমার যে মেলাতে আমার যায় সুরে

Expected Output:

“আমার সাঁঝবেলাতে যে তোমার সুরে সুরে মেলাতে যায়।”

Why it is harder:

Lower training frequency
Flexible word order
Morphological complexity
Tokenization challenges
Weak phrase anchors

4. Comparison

Aspect	English	Bengali
Training Data	High	Moderate/Low
Phrase Familiarity	High	Low
Word Order	Rigid	Flexible
Tokenization	Strong	Weaker
Accuracy	High	Lower

5. Internal Processing Steps

Token identification
Pattern recognition
Probability maximization
Sentence generation

6. Not a Language Limitation

This is not because the model cannot understand Bengali. It is due to statistical confidence based on available data and structure.

7. Improvements

More Bengali training data
Better tokenization
Fine-tuning with Bengali corpus
Better prompts

Conclusion

LLMs reconstruct sentences by maximizing probability over learned patterns. English performs better due to higher training exposure and rigid structure. Bengali introduces ambiguity and complexity, making reconstruction more challenging.