Phonemes and Visemes

Overview

A phoneme is the smallest unit of a language that can convey meaning, such as the m sound in Mary or th sound in thing. There are many phonemes and it is easy to get bogged down when using them. Below are some examples of phonemes and their respective facial shapes:

When you are manually lip-syncing a character to a dialog, listen for the different phonemes that make up the words that the character is supposed to be saying. Animators normally associate phonemes with the shapes that the mouth makes to create various sounds, like an O shape or an A shape. These mouth shapes are technically called visemes.

Disney animators realized a long time ago that using all phonemes was overkill. When creating animation, an artist is not concerned with individual sounds, just how the mouth looks while making them. Fewer facial positions are necessary to visually represent speech because several sounds can be made with the same mouth position. These visual references to groups of phonemes are called visemes. How do you know which phonemes to combine into one viseme? Disney animators relied on a chart of 12 archetypal mouth positions to represent speech, as you can see below:

12 Classic Disney Visemes

Visemes are the visual counterparts of phonemes. This is an important concept to understand because one Viseme usually has many phonemes (or sounds) associated with it. The m in mom and the p in pop are two distinct phonemes, but most animators will use the same viseme (mouth shape) to represent them in their animation. The number of visemes an animator uses is a personal choice and can vary between 5 and 20. Two example charts are given below. There are usually around 10-14 visemes.

Chart 1

Chart 2

[p, b, m] - Closed lips.
[w] & [boot] - Pursed lips.
[r*] & [book] - Rounded open lips with corner of lips slightly puckered. If you look at Chart 1, [r] is made in the same place in the mouth as the sounds of #7 below. One of the attributes not denoted in the chart is lip rounding. If [r] is at the beginning of a word, then it fits here. Try saying ?right? vs. ?car.?
[v] & [f ] - Lower lip drawn up to upper teeth.
[thy] & [thigh] - Tongue between teeth, no gaps on sides.
[l] - Tip of tongue behind open teeth, gaps on sides.
[d, t, z, s, r*, n] - Relaxed mouth with mostly closed teeth with pinkness of tongue behind teeth (tip of tongue on ridge behind upper teeth).
[vision, shy, jive, chime] Slightly open mouth with mostly closed teeth and corners of lips slightly tightened.
[y, g, k, hang, uh-oh] - Slightly open mouth with mostly closed teeth.
[beat, bit] - Wide, slightly open mouth.
[bait, bet, but] - Neutral mouth with slightly parted teeth and slightly dropped jaw.
[boat] - very round lips, slight dropped jaw.
[bat, bought] - open mouth with very dropped jaw.

The basic purpose of visemes is to create a hack to allow animators to quickly animate the face at various levels of detail. When you do not have the time or pipeline available to use phonemes, or you feel they are overkill for the detail level you want to achieve, they can be compressed or flattened into the amount of visemes you would like to get the job done. To see how helpful this information can be when animating a face take a word like hack. It has four letters, three phonemes, and only two visemes (13 and 9 in the listing).

Say that you don't have enough space to include 13 visemes and whatever emotions you want expressed. Well, by using Chart 1 and the list of visemes in the listing, you can make logical decisions of where to cut. For example, if you only have room for 12 visemes, you can combine viseme 5 and 6 or 6 and 7 below. For 11 visemes, continue combining visemes by incorporating viseme 7 and 9 below. For 10, combine visemes 2 and 3. For 9, combine 8 with the new viseme 7/9. For 8, combine 11 and 13.

If you were really pressed for space, you could keep combining and drop this list down further. Most drastic would be three frames (Open, Closed, and Pursed as in boot) or even a simple two frames of lip flap open and closed. In this case you would just alternate between opened and closed once in a while. This, however, results in animation that isn't very fun or realistic.

The hierarchical structure is something that should be understood. You can animate with muscles, phonemes and visemes. They can all exist together. Facial muscle movements make phonemes, and phonemes collapse into visemes. Take a look at the chart below:

Viseme Problems and Moving Beyond Phonemes

There are some obvious discrepancies with visemes for the high end user. One can see that [p,b,m] for instance have very subtle differences in animation/shape, mainly with how the mouth moves into the shape. A lot of animation studios use a muscle-based approach which actually goes beyond phonemes. This system actually uses muscle simulation to generate phonemes and then the phonemes to generate visemes in a sort of procedural stack that is user editable at any time. Weta used this kind of 'combination sculpting' to create and animate Gollum's facial expressions.

Documentation