Table of contents
- What is Kenlm and arpa?
- Output introduction
KenLM: Faster and Smaller Language Model Queries. It can reduce time and memory size significantly. For more information, please refer to the link.
Definition: “Statistical language describe probabilities of the texts, they are trained on large corpora of text data. They can be stored in various text and binary formats, but the common format supported by language modeling toolkits is a text format called ARPA format. This format fits well for interoperability between packages. It is not as efficient as most efficient binary formats though, so for production, it is better to convert ARPA to binary.”
First number below 1-grams, “p” in the line
This number stands for the probability, and it is computed by log10. The reason is to avoid small probability like 0.0000xx to a negative number.
The second number below 1-grams, “w” in the line
It is called backoff weights. The aim is to avoid some words cannot compute probability in the contexts. The formal explaination is as below. “In particular, all of our decoders use some form of n-gram grammar. Since it is (usually) impossible to generate a probability for every possible n-gram, a backoff strategy must be applied: to calculate the probability of a missing n-gram, a backoff weight is multiplied with the (n-1)-gram. There are different methods for calculating these backoff weights; see How do I build an n-gram grammar for noway or chronos? for details on how to do this with the SRILM toolkit.”
Other important comments
There are three “special” words in a language model: <s>, </s>, and <unk>. The <s> denotes the beginning of a sentence, and the </s> denotes the end of a sentence. The special word means “unknown” and is used in the language model to represent the probability of a word not in the model. Please see the reference.
If you want to insert a word into the ARPA file, please remember to also insert special words in bi-grams, tri-grams, and so on. Otherwise, the ARPA model will be broken down.
This record is for the reminder if partitioners would like to explore the KenLM. The benefits include reducing the exploration time to understand the terms and symbols in the ARPA model.