Very hand info about lypsync
info copied from:

TIP 1 - Start at 1x.
Give yourself the chance to see what looks right or wrong at a glance. It’s useful to know if small errors can be caught at normal speed.
There will be specific moments where something looks off, but you can’t tell why. Take note of these moments.
TIP 2 - Slow it down.
Speech happens so quickly. It is extremely difficult to pinpoint what went wrong/right without manipulation
After performing the initial spotcheck, slow things down and check the moments that felt “off.” Identify what phonemes occurred in those moments. Inspect them for signs of successes and failures.
Even if nothing felt “off” at 1x, it is still prudent to carefully search for common errors or signs of weakness. (We will go over what to look out for in the upcoming sections.)
TIP 3 - Ride the waveform.
After determining the phonemes you want to evaluate, you will often need to take extra precautions to make sure you are judging the correct frame. To ensure you are in the correct spot, use clues from the waveform. This part is quite critical. If you are a fraction of a second too early or too late, it will throw off your whole inspection and your findings will be irrelevant.
While there is no exact formula for what a waveform should look like for a particular phoneme, there are distinguishable properties between vowels, fricatives, nasals, etc.
Below you will find imagery that can help give you an idea of how these features visually translate:
TIP 4 - Observe the relationship.
Analyzing the correct frame is important, but single-frames won’t tell you the whole story. Since viseme shapes are relative and not absolute, observing the relationship between the target shape and the surrounding sounds is necessary.
For example, if you are looking at the word “gloss” and evaluating the “ss,” you will want to ensure that relative to the “o” in “gloss” AKA the [a] phoneme, the “ss” jaw position is more closed - or at least, not more open. Visualizing the transition from a more open position to a more closed position is important.
The CORCTET Method
Now that we’ve gone over each viseme group, the features that differentiate them, and tips & tricks to evaluate lipsync, let’s dive into actionable strategies for evaluating lipsync!
The following chart is a breakdown of something I have coined as the CORCTET Method. CORCTET covers the basic factors that go into the production of American English phonemes (and likely phonemes from many to most languages as well).
CLOSED | OPEN | ROUNDED | CORNERS | TEETH | ENVIRONMENT | TONGUE |
m/b/p | v vs. b | w/r/oo | eh/ih/ee/s/z | f/v | coarticulation | th |
non-m/b/p | n vs. m | ouh | pinched | th | relativity | n/l |
|
| aw | distance | s/z | accent / enunciation | a (all) |
|
|
|
| ch/sh/dge/ʒ | yelling / whispering | flat |
|
|
|
| top teeth show | anatomy | stuck out |
|
|
|
| lower teeth show | emotion | raised |
|
|
|
| vertical separation | original dialogue |
|
|
|
|
|
| something in mouth |
|
Expanding upon each category in the above table, we have:
CLOSED
Are the lips closed when they should be?
Cases when the lips should be closed include:
during m production
just before the sound from a p plosive
just before the sound from a b plosive
If a phoneme other than m, b, or p is present, the lips should NOT be closed.
OPEN
Are the lips open when they should be?
Just as important as ensuring the lips are closed for m, b, and p, we need to ensure the lips are open for all other non-m/b/p’s.
ROUNDED
Are the lips rounding for the appropriate phonemes (w, r, oo) or for phonemes that generally employ lip stretching/pinching (eh, ih, ee)?
CORNERS
Are the corners stretching or pinching for appropriate phonemes (eh, ih, ee, also s/z works) or for phonemes that generally require lip rounding (w, r, oo)?
TEETH
f/v- Are the top or bottom teeth showing?
th - Are the teeth closed or open? If open, how vertically separated are they?
s/z & ch/sh/dge/ʒ - Are the teeth touching/nearly touching or greatly vertically separated?
ENVIRONMENT
coarticulation - Are the lips rounding for an MBP? If so, does the following sound require lip rounding? If not, this may be inefficient! Efficiency influences how natural lipsync may or may not look.
relativity - s/z → a (“fall”) vs. a → s/z jaw opening
yelling/whispering/something in mouth
anatomy
accent/enunciation
emotion
original dialogue
TONGUE
th - Is the tongue showing between the teeth or is there a visible emptiness between the teeth?
n/l or a (as in “fall”) - Is the tongue raised upward, protruding, or laying flat?
Evaluating For Full-closure Bilabials: m/b/p
Phoneme | Graphemes | Word Examples | Voiced? |
b | b, bb | bug, rubble | yes |
m | m, mm, mb, mn, lm | man, summer, comb, column, palm | yes |
p | p, pp | pin, dippy | no |
As a recap, m/b/p is characterized by the meeting of the top and bottom lips to form a mouth closure. Formally, this is known as a full-closure bilabial.
m/b/p Rules
Regardless of various contexts like emotional state, coarticulation, speech rate, volume, etc. - there is one thing that must*NOTE 1 be true for m/b/p representation:
The lips must be closed. In general, the lips should be fully or almost fully closed for m/b/p sounds. Within this viseme group, m is a single-state phoneme and can produce continuous sound while the lips remain closed. b and p, on the other hand, require a two-state process: closed to open. While the lips should be closed for m/b/p, for m they may remain closed until the sound ceases; for b and p, the lips close first, then the sound is released during the opening.
NOTE 1: While these “truths” and “untruths” are generally robust, there are almost always exceptions.
Evaluating Lip Rounding: w/r (& oo)
Phoneme | Graphemes | Word Examples | Voiced? |
w | w, wh, u, o | wit, why, quick, choir NOTE: “Quick” is phonetically written as: kwɪk and “choir” is phonetically written as: ˈkwaɪəɹ | yes |
r | r, rr wr, rh | run, carrot, wrench, rhyme | yes |
u:(AKA “oo”) | o, oo, ew, ue, u_e, oe, ough, ui, oeu, ou NOTE: The in “ue” indicates that there is another letter in between the u & e -e.g. flute. | who, loon, dew, blue, flute, shoe, through, fruit, manoeuvre, group | yes |
At a basic level, w/r/oo is characterized by lip rounding with a small lip opening. The lip opening can vary somewhat in size depending how tensed the lips are; however, the opening tends to remain quite small. At the smallest level, it can be difficult to even see that the lips have an opening. The jaw must be open to some degree.
w/r/oo Rules
Regardless of various contexts like emotional state, coarticulation, speech rate, volume, etc. - there is one thing that must*NOTE 1 be true for w/r/oo representation:
The lips MUST NOT be fully closed. Fully closed lips makes w/r/oo production impossible. (Air needs to be able to escape!) However, there will be cases where the lip opening is so slight that the lips appear closed. As long as it is not 100% clear that the lips are fully sealed, ambiguous opening can be acceptable.
The lips are most natural when rounded. Though the degree of rounding is variable, relative to other sounds (e.g. eh, ih, ee), w/r/oo tends to draw the lip corners toward the midline of the face and the upper and lower lips toward each other.
The lips MUST NOT be significantly open (vertically + horizontally). The w/r/oo shape is marked by its constrictive nature. If the jaw and lip openings are too wide and/or tall, then the shape will not be readable as w/r/oo.
An important note about r:
It really matters what position the r is taking in a word.
When r occurs at the beginning of a word - e.g. “rest” or “red,” the r is significantly rounded.
When r occurs at the end of a word or syllable - e.g. “mother” or “father,” the r is much more relaxed and undistinguished.
Evaluating Lip-Tooth Interaction for f/v
Phoneme | Graphemes | Word Examples | Voiced? |
f | f, ff, ph, gh, lf, ft | fat, cliff, phone, enough, half, often | no |
v | v, f, ph, ve | vine, of, stephen, five | yes |
In general, f/v sounds require a small mouth opening to create a tight airflow. This is typically caused by the lower lip meeting the upper teeth. Sometimes you might not see the teeth easily due to factors such as lip size, enunciation, whether the teeth are pressing against the inner lip (inside the mouth) or the visible lip, etc. The jaw generally needs to be open to some degree in order to allow the teeth to separate and the lower lip to make contact with the upper teeth.
f/v Rules
Regardless of various contexts like emotional state, coarticulation, speech rate, volume, etc. - there are a few things that must not be true for f/v representation:
The lower teeth MUST NOT show. Due to the interaction of the lower lip with the upper teeth, lower teeth visibility is highly unlikely. If the lower teeth show, it will cause confusion in readability.
NOTE: In the extremely unlikely scenario that the lip-to-teeth meet is reversed, i.e. the upper lip touches the bottom teeth, lower teeth show is possible. Though it is possible, it is best to assume this will never happen unless there is a rare case like missing upper teeth.
There MUST NOT be a visible gap between the upper teeth and the lower lip. If a gap exists between the upper teeth and bottom lip, this means there is not enough constriction and contact to create a fricative sound.
The lips MUST NOT be fully closed. Fully closed lips makes f/v production impossible. (Air needs to be able to escape!) However, there will be cases where the lip opening is so slight that the lips appear closed. As long as it is not 100% clear that the lips are fully sealed, ambiguous opening can be acceptable.
The jaw MUST NOT be significantly open. Though it is difficult to draw the line at what might be considered “significantly open” (especially due to individual differences), this point is still necessary to keep in mind. The more the jaw opens past the minimum teeth separation point (~tongue-tip distance), the more difficult it is to produce a recognizable f/v.
The jaw is typically NOT fully closed. It is technically possible to produce f/v with a closed jaw, but a closed jaw will degrade the recognizability of f/v. Outside of clenched teeth, assume the jaw is slightly open.
NOTE: Due to lip occlusion, it is unlikely that we are able to visually determine whether the jaw is slightly open or fully closed for f/v’s; so, while this point is helpful to be aware of, in practice it is not likely to be an applicable metric.
Evaluating Nearly-Closed Teeth: s/z
Phoneme | Graphemes | Word Examples | Voiced? |
s | s, ss, c, sc, ps, st, ce, se | sit, less, circle, scene, psycho, listen, pace, course | no |
z | z, zz, s, ss, x, ze, se | zed, buzz, his, scissors, xylophone, craze | yes |
The prototypical s/z shape involves slight tension in the lip corners (AKA dimpler), lip separation, and gently touching or gently separated teeth; however, the lip corner tension is variable and may or may not be present.
Due to the tongue placements for s/z, the more you separate your teeth, the more difficult it becomes to produce s/z in a recognizable manner. Because of this difficulty, teeth separation can be a reliable metric for assessing s/z believability.
s/z Rules
Regardless of various contexts like emotional state, coarticulation, speech rate, volume, etc. - there are a few things that must not be true for s/z representation:
The lips MUST NOT be fully closed. Fully closed lips makes s/z production impossible. (Air needs to be able to escape!)
The jaw MUST NOT be significantly open. Though it is difficult to draw the line at what might be considered “significantly open” (especially due to individual differences), this point is still necessary to keep in mind. The more the jaw opens, the more difficult it is to produce a recognizable s/z.
Vowels Overview
A vowel is a speech sound that, unlike a consonant, does not require closure, turbulence, or constriction. The vocal tract is relatively unrestricted during vowel production, and vowels are always voiced.
vowel: a voiced and unrestricted speech sound.
rounded: a type of vowel that requires the lips to contract into a rounded shape.
relaxed: a vowel that does not take on a specific lip formation and instead leaves the lips relaxed and indistinct.
wide: a vowel that extends the distance between lip corners and widens the lips.
monophthong: a vowel that has a single perceived sound.
diphthong: a single syllable vowel made up of two vowel sounds gliding into each other.
Due to their lack of restriction, vowels often do not have very distinct shapes. Because of this and because of their variability, we will not be grouping them into viseme categories. Instead, we will focus on selecting vowels phoneme by phoneme.
Vowels are typically categorized by features such as:
tongue height
tongue backness
lip rounding
tenseness
For our purposes, we will be focusing on on our own categories:
rounded: lips take on a rounded shape; may be a tightly rounded shape like w/r/oo, a loosely rounded shape like ouh (ʊ), or anything in between.
relaxed: lips do not take on any particular shape and remain relaxed.
wide (corner-pinched): lips extend in width (increase distance from lip corner to lip corner) usually due to a lip corner puller (smile) or dimpler.
Rules for ouh
The lips MUST NOT be fully closed. Fully closed lips makes ouh production impossible.
The lips are most natural when semi-rounded. Though the rounding for ouh may be subtle in everyday speech, when the lips are slightly rounded, the ouh sound is most recognizable.
The jaw MUST NOT be fully closed.
留言