Of Names and Markov Chains

Coming up with even a handful of good names is a difficult matter - ask any writer. There are many sorts of endeavor that need more than just a few names, however: the next great doorstop fantasy novel; settings for pen and paper role-playing games; the same for any number of computer games. Once off into the world of software, there is a temptation to think that perhaps one should be able to generate names on the fly - or at least generate a bunch of raw material that will help with the process of creating names.

A standard first approach to generating names is to reach for some sort of Markov chain implementation. Think of a name as describing a sequence of states in a finite state machine, each state being a single letter. In a Markov chain, the transition to a new state depends only on the current state, e.g. here we could say that the next letter in a name is determined by the current letter. If you choose a set of names to work with as raw material for your Markov chain, you are in effect describing rules - the frequency with which letter X transitions to letter Y.

peter
paul
penny

For example, if we were using the names above as raw material, when the state machine has state "p" it will transition to state "a" one third of the time, and state "e" two thirds of the time. To be useful, however, you need a larger set of names; several dozen would be a good start. Given a large enough set of raw materials, you can use a Markov chain implementation to generate names that sound similar to the names you provided to define your state machine. You will also generate a large number of garbage or nonsensical or otherwise useless candidate names, alas. It is probably best to show what I mean by example here before talking about what can be done about it - so on to code next.

I have put the beginnings of some simple state machine tools for Node.js on Github in a package called state-engines. Why Node.js? Because I've used it for other side-projects that might need name-generating code in the future. There are other Markov chain or state machine packages out there for Node.js, but they are fairly specific to their intended use (e.g. generating blocks of random text), and for the purposes of this little exploration I desired an implementation that can be easily expanded beyond the concept of "a state is a single letter" and put to more general usage than just name generation. So state-engines contains an actual class hierarchy rather than a straightforward Markov chain implementation that is only implemented for strings:

StateEngine
  - MarkovChainStateEngine

State
  - UndefinedState
  - StringState

Converter
  - StringToStringStatesConverter

For simple states such as letters in a name, it seems a little like overkill to specify a Converter implementation in order to process a simple name string into an array of StringState instances, each containing a single letter. However the point of this exercise is that it is then (a) easy to alter the implementation of converter or state or engine in order to more easily experiment with different ways to build names, and (b) the basic infrastructure can be later extended for use with any form of state or state machine:

// A converter performs operations much like this: turning a name into
// a sequence of states, a path through the state machine. Here each
// state is a single letter of the name.
var stateSequence = "peter".split("").map(function(element, index, array) {
  return new StringState(element);
});

Given a Markov chain based state engine, you then generate a name by stepping through state transitions until the state machine transitions to an undefined state - meaning that the rules at that point indicate the end of a name:

// Generation of a name by a state engine using single letter StringState
// instances looks a little like the following code.
var name = [];

// Make sure that the state machine is undefined, and then transition it
// to the first letter of a name. By virtue of the names provided to it the
// state machine will have a rule defining what the first letter of a name
// should be - i.e. a transition from an undefined state to the first defined
// state.
engine.setCurrentStateToUndefined();
engine.transition();

// Now we add letters, one at a time, until we have reached an undefined
// state again, which indicates the end of the name.
while (!engine.currentState.isUndefined) {
  name.push(engine.currentState.representation);
  engine.transition();
}

// And print the result.
console.log("The newly generated name is: " + name.join(""));

To get to the example that demonstrates how this simple letter by letter approach delivers a great many garbage names and only a few useful names, let us create a Markov chain for the names of angels. Any coherent source of similar ancient names will do, but lists of angelic names are easy to find. A test set of names should have a very distinctive character in order to make it easy to see how well your name generator is doing.

var stateEngines = require("state-engines");

// Some names to use as raw material to define state transition probabilities.
var names = [
  "abbadon", "adriel", "ambriel", "amesha spenta", "arariel", "ahriman", "ariel",
  "azazel", "azrael", "abymael", "barachiel", "cassiel", "camael", "dumah",
  "eremiel", "gabriel", "gadreel", "gagiel", "hadraniel", "haniel", "harut", "hesediel",
  "israfel", "jegudiel", "jehoel", "jequn", "jerahmeel", "jophiel", "kasdeja",
  "kiraman katibin", "kushiel", "kosmiel", "leliel", "lucifer", "maalik", "malik", "marut",
  "metatron", "michael", "munkar", "muriel", "nakir", "nuriel", "ophanim", "orifiel",
  "pahaliah", "penemue", "puriel", "qaphsiel", "raguel", "raphael", "raqib", "raziel",
  "remiel", "ridwan", "sachiel", "samael", "sandalphon", "sariel", "selaphiel",
  "seraphiel", "simiel", "shamsiel", "tzaphqiel", "temeluchus", "uriel", "uzziel",
  "yehudiel", "yerachmiel", "zabaniyah", "zachariel", "zadkiel", "zephon", "zophiel"
];

// Create the state engine with a suitable converter and feed it the names.
// This converter turns strings into a sequence of States instances of a
// single letter each, and vice versa.
var converter = new stateEngines.StringToStringStatesConverter();
var engine = new stateEngines.MarkovChainStateEngine(converter, names);

// Generate new names.
for (var index = 0; index < 10; index++) {
  console.log(engine.generateEntity());
}

So what does this give us? I'm sad to say that the following is entirely representative:

hel
jophikamiehrurehan
ziel
membaman
sel
jel
michifel
mazzimusr
sabamel
amazashachamiemifemaran

You'll notice that we managed to generate the name of a Norse mythic figure from a list of Old Testament angels, in addition to a couple of demi-plausible names, and a bunch of garbage. If you run this a few dozen times you might sift out some good, useful names that sound angelic and would otherwise have required inspiration to create. This is in fact a common use for Markov chains and other conceptually similar processes of stochastic rearrangement - jog the brain into directions it wouldn't have taken on its own, and thus help the creative process move forward.

Unfortunately, this is pretty much useless as a sole source of names. You couldn't put a process like this into your game code and blindly accept the output. You could instead use it as a tool to quickly build a few hundred names that you like and then put those names into your software as a static resource - assuming that a few hundred covers the ground sufficiently to avoid repetition when picking from it randomly.

Why is a letter-by-letter Markov chain so useless? In short because it is a monkey with a typewriter and a few constraints: it will manage to accidentally do the right thing some of the time, and the frequency of useful output somewhat depends on the constraints put upon this monkey-typewriter system. Here are a few things to think on, some of which might be solved with suitably clever code, and some of which might be beyond any reasonably cost-effective software at the present time:

No Understanding of Name Length

The undefined state marks the end of a name, and so far as the Markov chain rules are concerned, it's just another thing exactly like a letter. So whether or not a name ends depends on the current letter - and you'll note that many angelic names end with "l". So as soon as "l" crops up, there is a fair chance that the name will come to an end, no matter how short it might be. There is a similar issue with the generation of names that are far too long.

It is perfectly possible to try to address this by extending the concept of a state from "letter" to "letter at position N" - though this would require a much larger set of names as raw materials in order to give a reasonably diverse output. Other, more blunt approach involves setting minimum or maximum lengths to names, though in that case you run into other issues in terms of quality of output: names will cut off in the middle of a syllable.

No Understanding of Letter Position in a Name

While we are thinking of name length, it is worth noting that some letters occur more frequently earlier or later in a name. This is a part of the characteristic shape of names from a certain source. Consider "l" again for angels: you don't tend to see that in the first few letters, for example.

Here again, we could think about expanding states to consider some measure of position, either concrete such as "letter at position N" or a more fuzzy measure of distance from the start of a name. This has issues in the current model, however, as you don't know how close you are to the end of a name when you are simply following a Markov chain of letter transitions. So you could decide on a name length in advance and run two Markov chains, one moving backwards from the end of the name and one running forwards from the start, and see what you end up with when you meet in the middle.

No Understanding of Syllables

All this business about letter positioning is really just a (possibly bad) way of worrying about syllables. Words are not really constructed of letters. Words either have specific meanings or contain the echoes of old meanings - it is those echoes that give names from various cultures or periods their characteristic shape. Look at the list of angelic names again - you'll see one or two syllables showing up in certain places in the names over and again.

Thus we might start worrying more about word order in our state machines. Perhaps we can say that a state is now two letters long except at the beginning and end of a name, and we build names by overlapping an array of states. So a name like "michifel" would break down into these strings:

["m", "mi", "ic", "ch", "hi", "if", "fe", "el", "l"]

This may help generate a greater fraction of useful-looking names as output from the state engine, but again would probably require a larger collection of names as raw material.

No Understanding of Meaning

Names, as mentioned, have meaning. The only reason to care about letter position and syllables is because they have meaning.This is really the kick in the teeth for any method of automation for name generation. There is a large difference between putting together names for a fantasy setting wherein you can wave your hands to say "oh yes, and 'asik' is a traditional suffix for the names of Southern herders" and a fantasy setting such as Middle Earth wherein you already know that cannot be true - where names have to follow more complex and established rules.

That issue of more complex and established rules also exists in settings that are leaning on exotic real world cultures: faux-Medieval-Spains, faux-Caliphates, faux-Imperial-Chinas and so forth. To evoke that background you have to get the meaning right in the names. This is exceedingly hard to arrange by thinking only about letter-by-letter rules.

It is this point of meaning that really drives my recommendation to use name generation as a source of raw material - a way to quickly build a list of names that can then be used as a resource, rather than creating names on the fly in the final code.