While trying to build a Malayalam handwritten character database, we were trying to figure out what the most common letters in the Malayalam alphabet are. Here are our findings.

The source text used is the Malayalam Indian revised version Bible available here. It is safe to say that the translation of the Bible used could affect the list of the top words, but the letters list should not be affected.

The book was chosen because of a few reasons:

Process

The Unicode block of Malayalam is U+0D00–U+0D7F. All the books in the Bible were read and loaded as a string; whitespaces were striped. Each character evaluated to check if it fell in the desired Unicode range, and if it did, it is added to a dictionary of characters, and the corresponding count is incremented.

Pareto principle

Letters and words follow the Pareto principle in almost any human language. 20% of the words make up 80% of all text content. Unsurprisingly, 20% of all letters make up 80% of all the words.

20% of the words make up 80% of all text content.

Words

Commonly used words change with the main context of the content. Using the Bible as the reference text, the most common words were: ഞാൻ, അവൻ, എന്റെ, അവർ, നിന്റെ, എന്നു, എന്ന് … in the same order.

+-----------------------------------+-------+
|               Letter              | Count |
+-----------------------------------+-------+
|                ഞാൻ                |  6447 |
|                അവൻ                |  5191 |
|                എന്റെ               |  4198 |
|                അവർ                |  4102 |
|               നിന്റെ               |  3924 |
|                എന്നു               |  3838 |
|                എന്ന്                |  3722 |
|                ഒരു                |  3705 |
|               അവന്റെ               |  3334 |
|                 നീ                |  2976 |
|                യഹോവ               |  2522 |
|               പറഞ്ഞു.              |  2445 |
|               നിങ്ങൾ               |  2161 |
|                തന്റെ               |  2121 |
|               അവരുടെ              |  2116 |
+-----------------------------------+-------+

See the full wordlist here. (7mb text file warning ⚠️ )

It is clear that the words change according to the material used.

Letters

The Bible had 3535260 Malayalam characters in it in which ranked the most used with a count of 498417, which adds up to a 14% of all Malayalam characters in the source text.

It is safe to assume 1 out of every ten characters in Malayalam is a ്.

The reason is that all connected letters use a between them, and Malayalam has no shortage of connected letters.

The second most used Malayalam character was ന,which was used 269217 times. This was 7.6% of the total composition. This letter was particularly interesting because many of my guesses did not have this letter in it.

Here are the top letters:

+--------+--------+-----------------------+
| Letter | Count  |       Percentage      |
+--------+--------+-----------------------+
|    ്    | 498417 |   14.098453861950746  |
|   ന    | 269217 |   7.615196619201982   |
|   ു    | 233948 |   6.617561367480751   |
|   ി    | 224958 |   6.363266068125116   |
|   ക    | 216007 |   6.110073940813406   |
|   ത    | 160260 |    4.53318850664449   |
|   ാ    | 138791 |   3.9259064396960905  |
|   യ    | 137908 |   3.900929493163162   |
|   വ    | 127680 |   3.611615553028632   |
|   െ    | 116590 |   3.2979186820771313  |
|   ര    | 112997 |   3.196285421722872   |
|   ട    | 100737 |   2.8494933894536754  |
|   പ    | 94481  |   2.672533279023325   |
|   ം    | 89896  |   2.542839847705685   |
|   മ    | 74402  |   2.104569395178855   |
|   ല    | 67913  |   1.9210185389476302  |
|   ച    | 60989  |   1.725163071457262   |
|   അ    | 60501  |   1.7113592776768893  |
|   ോ    | 55794  |   1.5782148979141561  |
|   റ    | 54747  |   1.5485989715042174  |
+--------+--------+-----------------------+

View full list here

It is clear that this satisfies the Pareto principle

Character distribution in the Bible

Code

Here’s the code