Training a computer to properly process and understand human languages is of course a very complex problem in AI. Despite this, getting one to “talk” is not too difficult with some training data and simple probabilistic methods. (If you’re familiar with IRC chatterbots, or Reddit’s /r/SubredditSimulator, then you’ve seen this idea in action.) In this article I’m going to discuss how I decided to implement such a system (source code).
The basics
At the heart of this method is an object which I will call a “predictor.” A predictor is trained by observations of the form \((x, y)\). Given an \(x\) it has previously observed, it will return a random one of the seen \(y\)’s, with probability based on the frequency with which it has observed \((x, y)\).
A very simple sentence generator can then be made as follows. Suppose we have a dataset of many lines of text which we want to imitate. Let \(f\) be a predictor. Take a line from the dataset, split it into words and surround it with start and end symbols \(A, Z\) to obtain a list \(A, w_1, w_2, \dots, w_k, Z\). Then, train \(f\) with the observations \((A, w_1), (w_2, w_3), \dots, (w_{k-1}, w_k), (w_k, Z)\). Repeat this for each line in the dataset.
To generate a line of text, let \(w_1 = f(A)\). If \(w_1\) is not \(Z\), let \(w_2 = f(w_1)\). Repeat this iteratively. This builds a sequence of words \(w_1, w_2, \dots, w_k\) we can stitch back together.
Mathematical aside: Text generators utilising the type of method in this article are often said to be based on Markov chains. This is because when we iteratively reapply \(f\) to its own predictions (as just discussed), we obtain a Markov process in the state space of words. In essence, \(f\) is a state transition function with probabilities trained with data.
Unfortunately though, this simplified method has some shortcomings.
- The generated text will almost certainly make no sense. In human languages, a choice of word is dependent on far more than the previous one in the sentence, which is all that this method considers.
- It’s completely random – we cannot choose to generate responses containing a specified word. Having this would allow us to “converse” with our bot.
- It doesn’t handle noisy data well. For example, differing formatting between two otherwise-identical words cause them to be treated differently. “HI!” and “hi.” will be learned as being different, when we may wish them to be interchangable. If we could discard some information from the words, we would have more inter-connectivity in our Markov chain and we would see more original responses. However, we would like the responses themselves to still contain this extra information.
An improved method
To solve these problems, I developed the following system. From what I’ve seen, most text generators of this kind work in a similar fashion.
Firstly we choose a “normalising” function \(N(w)\). This is intended to be applied to a word to strip it of non-grammatical information such as case, superfluous punctuation, etc. \(N(w)\) is called the “norm” of \(w\).
For each line in our data set, split it like before into \(A, w_1, w_2, \dots, w_k, Z\) where \(A, Z\) are dummy start and end words. We assume \(k \ge 2\).
We train four predictors, called \(\fwd\), \(\back\), \(\case\), and \(\seed\).
- \(\fwd\) is trained with the observations \(((N(w_1), N(w_2)), w_3), \dots, ((N(w_{k-1}), N(w_k)), Z)\)
- \(\back\) is trained with the observations \(((N(w_k), N(w_{k-1})), w_{k-2}), \dots, ((N(w_2), N(w_1)), A)\)
- \(\seed\) is trained with the observations \((N(w_1), w_2), \dots, (N(w_{k-1}), N(w_k))\)
- \(\case\) is trained with the observations \((N(w_1), w_1), \dots, (N(w_k), w_k)\)
After processing the whole dataset, we can generate a sentence like so:
- Choose an observed norm \(n_0\). Let \(w_1 = \seed(n_0)\) and \(w_0 = \case(n_0)\).
- Let \(w_2 = \fwd((n_0, N(w_1)))\). If \(w_2\) is not \(Z\), let \(w_3 = \fwd((N(w_1), N(w_2)))\). Repeat iteratively to give a sequence of words \(w_2, \dots, w_k\).
- Let \(w_{-1} = \back((N(w_1), n_0))\). If \(w_{-1}\) is not \(A\), let \(w_{-2} = \back((n_0, N(w_{-1})))\). Repeat iteratively to give a sequence of words \(w_{-1}, \dots, w_{-j}\).
- Form the response from the words \(w_{-j}, \dots, w_{-1}, w_0, w_1, w_2, \dots, w_k\).
How it works
This is significantly more complex, so let’s explore it.
The predictor \(\fwd\) should be compared to \(f\) in the simple example we discussed earlier. We see there are two changes.
Firstly, it is now learning what word to give based on information from the two previous words, instead of only the one. This provides a partial solution to problem #1 from earlier. It is still very naive, but now when generating a response it considers a larger context for each word choice. The result is more intelligible (but still ultimately nonsensical) output.
Secondly, the inputs of the observations are based on norms, not words; they are \((N(w_i), N(w_{i+1}))\), not \((w_i, w_{i+1})\). As I mentioned, the normalising function \(N\) is meant to discard any extra non-grammatical information of a word. This change means that given an input norm pair, \(\fwd\) will have more choice for words to return, from observations from different word pairs that have the same norm. By learning only with the more essential information, the output will be more varied and interesting. This solves problem #3.
The other three predictors are purely for solving problem #2. We wish to, given a norm \(n_0\), build a sentence which is guaranteed to contain some word \(w_0\) such that \(N(w_0) = n_0\). Firstly, we require \(\case\) to turn this norm back into some word. We also require \(\seed\) to give us some word \(w_1\) which has been observed to follow \(n_0\). Now we have a pair of norms \((n_0, N(w_1))\) to kick off generation of the rest of the words using \(\fwd\) and \(\back\).
For more details, or a working implementation, you can see the source to my own Python module called Chatter.