Hello,

In the world of gaming, realistic human characters are essential, and compelling facial animations play a crucial role in creating a sense of realism. The field of realistic facial animation has made significant strides, with technologies like Apple ARKit and Google MediaPipe, which can map a real face onto 52 parameters, known as blend shapes.

Here's an example (NOT MINE) of how this technology can work (this tech is mainly used by vtubers):

A human face can be represented as a vector of blend shape values ranging from 0 to 1. Given that facial expressions are continuous, changes in a face over time can be depicted as a path in the 52-dimensional blend shape space. This space is relatively limited since an excessive number of blend shape value pairs can result in unnatural expressions.

My objectives are twofold:

  • Lip sync from text: This task is achievable but requires substantial effort to construct the intensity of each phoneme over time, given the 1:1 correlation between phonemes and blend shapes.
  • Procedural animations: This goal is challenging and requires extensive research.

My plan is to:

  • Develop an emotion path dataset
  • Use a periodic autoencoder to generate paths
  • Record and blend multiple paths procedurally

I have already achieved the following:

  • Transformed the blend shapes array into a real face in Godot using the blend shapes of the mesh instance.
  • Created a Python tool to extract blend shapes from a video by processing each frame, following an unsuccessful attempt with VMC (due to unsatisfactory framerate handling).

Here's an example of a Vroid model with blend shapes in godot

The python extractor that transform a video into blend_shapes values over time

Please feel free to share any ideas on creating high-quality data, efficiently animating the MeshInstance3D, or any other useful insights.