Animaj introduces the second phase of its AI tool “Sketch-to-Motion”: Motion-In-Betweening.
This breakthrough technology accelerates the animation process by automating in-between frames, allowing animators to focus on creative refinement rather than repetitive work—bringing high-quality, style-consistent animation to life faster than ever.
The Motivation
In our last blog post we introduced our sketch-to-pose model, which generates animation key poses from sketches.
The next challenge is filling the frames between these key poses to create a complete animation that stays true to the artistic style of the IP and the unique motion characteristics of each character.
This style includes complex motion elements that simple interpolation methods in Maya (step, linear, spline) cannot reproduce, such as:
- Breakdown poses that define key movement patterns
- Motion spacing, with ease-in and ease-out effects to control acceleration and deceleration.
- Global rhythm and timing, ensuring the motion follows the intended pacing.
- Additional details, like overshoot and anticipation, which add liveliness to the animation.
These design choices are unique to each IP and can be learned from existing data using a deep learning model, allowing for automated in-betweening that faithfully reproduces the intended animation style of each character and IP.
The Data
Our model training uses supervised learning, which requires known targets and input values. In our project, we frame the problem as a masked sequence where some frames are known and others need to be filled in by our AI model to complete the full motion sequence.
We design our dataset to include the known keyframes, which we call block keyframes, along with the full target sequence. This allows the model to learn how to fill in the missing intermediate frames and generate a complete sequence.
Our dataset is built from the first four seasons of Pocoyo, which includes 298 episodes and a total of over 770 000 frames.
To create the training dataset, we transform the full motion sequences into fixed length sequences. At test time, the sequence lengths vary. The input consists of individual 3D pose rigs with missing (masked) frames, and the model predicts the corresponding complete motion sequence.
Each 3D pose rig is represented as a numerical vector encoding skeletal controller values—such as head rotation, hand position, and elbow orientation. The model, a neural network, learns to generate smooth, full-body motion by predicting the in-between values that reconstruct the complete animation from partially observed poses.
The Model
The model is based on a bidirectional Long Short-Term Memory (LSTM) network, a type of recurrent neural network (RNN) designed for sequential data. Unlike standard neural networks, RNNs have memory, allowing them to learn patterns over time.
LSTMs improve on traditional RNNs by handling long-term dependencies more effectively. They use a gating mechanism to control the flow of information, deciding what to keep or discard. This helps maintain relevant motion details over longer sequences, reducing errors caused by forgetting past frames.
A bidirectional LSTM goes further by processing the sequence in both forward and backward directions. This is ideal for motion in-betweening, as missing frames depend on both past and future keyframes.
The model predicts a pose for each frame, ensuring transitions that preserve the natural movement patterns and style present in the data. Thanks to its efficient structure, inference is fast, making real-time generation possible.

The Metric
To assess the success of our model, we designed a test metric to measure its performance.
We developed a heuristic to identify the parts of the full sequences that contain key motions, such as taking a step while walking or raising an arm to lift a hand.
Our metric then measures the difference between the predicted key pose and the target known pose.
This approach ensures that the model accurately captures and predicts these key movements, ultimately enhancing the overall quality of the generated motion sequences.
The Experiments
We evaluated and refined the model using this quantitative metric combined with a qualitative analysis.
Several aspects of the system were iterated upon:
- Model selection: Different architectures were tested, including LSTM, Transformer, and Diffusion models.
- Loss functions: Various weighting strategies were explored, adjusting loss weights based on the controller type and frame importance. Additional losses on speed and acceleration were introduced to better capture motion dynamics.
- Optimization: Multiple optimization strategies were tested, with cyclic learning rate proving to be the most effective.
- Data handling: Different masking strategies were experimented with to improve generalization and performance.
The Results
Comparison of Maya spline interpolation and Animaj AI-driven Motion In-Betweening. Our deep learning model preserves the character's unique rhythm and spacing, ensures correct breakdown poses in the walk cycle, and eliminates self-penetration issues.
Productivity gains
A 3D animator was tasked with in-betweening a 3D scene with block keyframes using two different methods.
In the first approach, the entire in-betweening was done manually, without using our model. In the second, the in-betweening model generated the intermediate frames, and the animator refined the predictions.
With the traditional workflow, the animator spent 339 minutes completing the in-betweening, whereas refining the model’s output only took 113 minutes, which represents a 67% productivity gain.
This result represents a significant time gain for a task that is typically labor-intensive and filled with repetitive, non-creative work. By letting the model handle the bulk of the in-betweening, the animator can focus on refining details and enhancing the expressivity of the motion, rather than spending hours on routine pose interpolation.

Conclusion
With our Motion In-Betweening deep learning model, we've taken a step toward making animation production more efficient while staying true to the artistic intent of each project.
Traditional in-betweening is a time-consuming process, requiring careful attention to timing, spacing, and motion style. By automating this step while preserving the unique characteristics of the animation, our model helps animators focus on creative decisions rather than repetitive work.
As AI-driven tools advance, they open new possibilities for producing complex, stylized animations more efficiently, accelerating creative experimentation and enhancing artistic expression.