Large Language Models (LLMs) like ChatGPT are remarkably good at reconstructing meaningful English sentences from jumbled words. However, the same capability often appears weaker for Bengali. This is due to differences in training data, tokenization, linguistic structure, and probabilistic prediction mechanisms.
The model predicts the next token based on prior tokens using learned probabilities.
Input:
percent nine genius and perspiration ninety percent is one inspiration
Output:
This works well because the model has seen this famous quote frequently.
Input:
বেলা সুরে সুর সাঁঝবেলাতে তোমার যে মেলাতে আমার যায় সুরে
Expected Output:
| Aspect | English | Bengali |
|---|---|---|
| Training Data | High | Moderate/Low |
| Phrase Familiarity | High | Low |
| Word Order | Rigid | Flexible |
| Tokenization | Strong | Weaker |
| Accuracy | High | Lower |
This is not because the model cannot understand Bengali. It is due to statistical confidence based on available data and structure.
LLMs reconstruct sentences by maximizing probability over learned patterns. English performs better due to higher training exposure and rigid structure. Bengali introduces ambiguity and complexity, making reconstruction more challenging.