Part-of-Speech Tagging Using Hidden Markov Model (HMM) & The Viterbi Algorithm
Project Goal: To be able to predict the part-of-speech tag for each word in an unseen book text.
Personal Goal: To apply my knowledge of the Hidden Markov Model and the AI Viterbi Algorithm to predict the POS tag for new content after training it on previous texts.
POS Tags
The Hidden Markov Model (HMM) is built on one assumption stating that the future is independent of the past given the present, or in other words, only what happens today is going to affect what happens tomorrow. This is seen in the HMM as we assume that the probability of any current state, given the previous state is always going to be the same for any time step. This is known as a stationary process since the dynamics do not change over time. The HMM is utilized in this project to find the most likely sequence of hidden states given a sequence of observations via the Viterbi algorithm. The process is split into the training part where the program learns the POS Tags from a dataset and then the prediction where it makes educated guesses for never-before-seen data.
TRAINING:
To begin training, the class pos_tagging makes an in-class function call that reads, learns, and records the data from the training files. The actual training consists of compiling matrices with information from the training files that will be used later to make the prediction. More, specifically, the find_init_prob function compiles two matrices, the Transition matrix, and the Observation matrix. Starting with the transition matrix, its entries represent the likelihood of a specific POS tag being followed by another POS tag (dimensions are # of POS tags by # of POS tags). For example, a row for the transition probability matrix would entail that for the tag AAA it has a 'prob1' of being followed by the AAA tag, a 'prob2' of being followed by the BBB tag, and 'prob3' of CCC tag following it etc... The observation probability matrix on the other hand is the likelihood of a specific tag appearing with a word in the text (unique words only). Thus the dimension is # of POS tags by # of unique words. This implies a row states how likely tag AAA appears with the word Once, upon, a, time, and the rest of the words, etc... This is the probability matrix because of the relationship between tags and words that it describes and the transition table because it relates the current state with the next state. Lastly, there is also an init_prob matrix which represents how likely each tag appears at the start of a sentence (dimension of # of POS tags by 1).
PREDICTION:
There are three main functions that deal with the prediction stage:
Viterbi Algorithm: This function follows a mathematical expression that uses the probability from the previous time step, the transition probability, and the observation probability for the current time step to estimate the probabilities of the current state observation to each possible observation. The output is saved in two matrices: 1) Prob: this is the matrix that has the actual probabilities of all POS tags pertaining to a specific word, the matrix dimensions are # of unique words by # of POS tags, meaning that a row is a word's probability of being associated with each POS tag. 2) Prev: this matrix holds pointers of each state to the previous state that resulted in the current one (same dimension as prob). For example, if you find a POS tag of BBB for the word 'end', its entry would be a pointer to where the prediction resulted in the POS tag of AAA for 'the'. These pointers are stored as the indices of the POS tags in the prob matrix. The Viterbi algorithm function starts with the first state (the start of the sentence) which has no prior observation which is why we use the init_prob table from the training and the observation table for the words we have seen in training, else if it is a new word then we just use the init_prob table. Subsequently, a loop is set for all the following words in the sentence for which we use the transition table, the observation table, and the prob table for the previously-seen words. As for the new words, we simply don't use the observation table. Once all this is achieved, the matrices are normalized.
Find Sequence Function: It performs backtracking to find the desired sequence. This function returns a list of indices for the most likely POS tag sequence for the current sentence given by the tag's respective index. The function makes use of the Viterbi prob and prev matrices to find the sequence. Firstly, the index of the tag with the highest probability for the last word in the text from the prob matrix is found. Then, through the while loop, I am able to propagate backward (up the matrices) from the last word to the first word. This is achieved by finding the index that lead to the current prediction via the prev matrix - by selecting the prev matrix entry positioned at the word row and the current tag index column (recall that for the prev matrix, each row has the index of the word that comes before it). Lastly, this is added to the indices list and I make the current tag index the prior index.
Predict Function: This function calls the Viterbi algorithm and the sequence function previously discussed. It begins by opening the new unseen file containing the text. I have also implemented a time fail save that makes a prediction with what it has done so far in case it is running out of time at the start of the function. The for loop that is quickly encountered works by appending the words it is seeing to a list until it comes across a period. Once a period is seen, the program makes a prediction on the POS tag for the current sentence. Doing numerous predictions instead of one big prediction at the end of the text allows us to minimize the computational space needed as well as decreases the time spend indexing matrices. The predictions are performed by calling on the Viterbi algorithm for the sentence at hand and then giving the probs and prev matrices outputted by it to the find sequence function. After adding the POS tag list received from the find sequence function to the prediction history, the for loop continues to repeat this process for the next sentence. Lastly, in case the given text does not end with a period the same procedure is called upon for the remaining words (repeated code blocks in the image for demonstration purposes)
Viterbi Algorithm
Find Sequence Function
Prediction Function
Input & Output:
Input: A text file with a book's text written in one word/syntax per line. (~50k words)
Output: A text file with all the words/syntax tags next to them indicating what the HMM thinks the respective word pertains to. (~50k tag predictions)
Github: Input File
Github: Output File
Since this is a school-related project I am unable to share all the code publicly but contact me for the entire code if you want it.