January 4, 20268 min read

Kling 2.6 Review: Can One-Click Video and Audio Generation Truly End the "Silent AI Film Era"?

Kling recently released its new 2.6 model. Although earlier AI‑generated videos were usually “silent films” that required post‑production dubbing, Kling 2.6 now supports simultaneous generation of video and audio (dialogue, ambient sound, sound effects).

The 2.6 model includes built‑in default voices, and it also allows users to upload their own audio to create custom voices.

Users can name an audio file and then @ the audio character in the video prompt, allowing the character to speak directly in the generated video.

Below is my in‑depth evaluation of the new version.

What’s the biggest upgrade in Kling 2.6?

  • Native Audio‑Video Synchronization: No more silent videos — the model can now generate video and audio together in one pass.
  • 1080p HD Output: Supports native 1080p resolution.
  • Pro API (Artlist): Provides API support aimed at professional filmmakers.
  • Improved Character Consistency: Better consistency of characters across different shots.

Core Technical Breakthrough: Native Audio‑Video Generation

Kling 2.6 is no longer just a visual generator. It integrates the Kling‑Foley model, enabling millisecond‑level alignment between sound effects and visuals. This allows it to generate visuals, natural voice acting, sound effects, and ambient audio all at once.

Users can specify environmental sounds (such as city street noise) and background music (like a soft piano track) directly through the prompt.

  • Audio‑Visual Unity: Solves the long‑standing issue of AI videos having mismatched lip‑sync or requiring manual dubbing afterward.
  • Semantic Understanding: The model deeply understands the emotions and narrative in the prompt, allowing the rhythm of the audio to match the character’s movements precisely.

In the feature panel, users can upload an image, @ a voice character, and the character in the generated video will speak according to the text in the prompt.

Key New Feature: Voice Cloning and Control

This is the most practical feature for creators:

Voice Customization:

  • Users can upload 5–30 seconds of audio or video to clone a specific person’s voice.

Character Binding:

  • By using @CharacterName in the prompt, the system locks the voice to that character, ensuring consistent vocal identity across different videos.

Cross‑Language Performance:

  • Supports Chinese–English cross‑language speech while maintaining the same voice timbre.

Full Coverage of Application Scenarios

The model supports multiple types of audio, greatly expanding creative possibilities:

Character Dialogue:

  • Supports single‑person monologues and multi‑character conversations, with naturally synchronized lip movements.

Musical Performance:

  • Can generate videos of singing, rap, and even instrument playing.

Environmental Sound & ASMR:

  • Accurately simulates detailed sound effects such as glass breaking, crackling fire, and whispering.

Commercial Creativity:

  • Suitable for product explainers, e‑commerce livestreams, short film production, and other professional use cases.

Motion Control

Another major selling point of Kling 2.6 is Motion Control, which can precisely replicate the actions, expressions, and gestures from any reference video, while achieving perfect lip‑sync.

Full‑Body Motion Cloning:

  • Supports capturing detailed body movements such as dancing or martial arts.

Expression Mapping:

  • Produces highly realistic facial expressions that accurately reflect the emotions in the reference video.

High Stability:

  • Compared to previous versions, 2.6 significantly reduces visual jitter and artifacts, with more natural background blending.

However, the model is highly dependent on the quality of the reference video. If the reference motion is unclear, the generated output may still show drifting or instability.

Additionally, Motion Control is expensive — for example, a 5‑second video can cost around $3.

Video Quality

Unlike older versions that generated videos frame‑by‑frame, Kling 2.6 introduces “structured reasoning.”

It can track character identity, clothing, and props across the entire timeline, ensuring that the background environment and physical logic remain consistent throughout a 5–10 second shot.

Kling 2.6 handles fabric folds, hair dynamics, and complex body gait with far more realistic physics, significantly reducing the common AI issue of “drifting.”

The model also uses a unified multimodal memory system, ensuring that during complex camera movements (such as 360‑degree rotations or push‑pull shots), the character’s facial features and clothing remain highly consistent.

Lighting Stability:

The upgraded lighting logic ensures that shadow changes follow physical rules, reducing the “flickering” often seen in AI‑generated videos.

In Image‑to‑Video mode, Kling 2.6 analyzes human motion with exceptional accuracy, especially in the naturalness of facial expressions — reaching industry‑leading performance.

Pricing

Kling 2.6 supports both text‑to‑audio‑video and image‑to‑audio‑video generation.

You simply include the dialogue content directly in the prompt.

Pricing Structure:

  • Native audio enabled: 10 credits per second (Pro mode)
  • Voice Control enabled: +2 credits per second (free and unlimited for subscribers)
  • Audio disabled: 3–5 credits per second

Kling 2.6 — Notable Issues

“AI‑Generated Feel” Still Very Noticeable

Unnatural motion and facial stiffness: Sometimes the video still looks unnatural — for example, only the mouth moves while the facial muscles remain stiff, body movements appear awkward (stilted movement), or the face becomes distorted (face morphing).

Some characters have eyes spaced too far apart, creating an uncanny appearance.

During fast movements, knees, wrists, and arms may “wobble” or deform unnaturally.

Non‑natural torso deformation: The waist and hips often show unrealistic distortion during motion.

Audio Issues

Lack of spatial realism

Sometimes the audio sounds “pasted on” rather than part of the environment. For example, football crowd noise should include stadium reverb, but instead sounds like it was recorded in a small room.

Ignored audio instructions

The model occasionally disregards user commands about environmental audio.

Lip‑sync inconsistencies

Lip movements sometimes feel slightly off, reducing realism.

Although “native lip‑sync” is advertised as a core feature, in practice 5‑second clips often end lip movements too early, forcing users to generate 10‑second clips instead.

Language Differences

Chinese speech often sounds more natural than English. When users input Japanese or other languages, the system auto‑translates to English, reducing naturalness and quality. This creates real barriers for Japanese, French, Spanish, and other creators.

No Frame‑by‑Frame Keyframe Control

No precise timing control: Users cannot specify key actions like “the character turns their head at the 3‑second mark.”

Prompting feels more like giving high‑level instructions to a director rather than performing precise animation control, limiting professional‑grade production.

No Support for Uploading Background Music or Songs

Users cannot upload branded music or specific tracks. Everything must be described via prompts, making brand consistency difficult and preventing integration of specific soundtracks (e.g., corporate music, licensed scores).

Still Far from True AI Filmmaking

There are still many “awkward acting” moments, such as unnatural pauses.

Some users believe it may take around two more years before AI can produce fully acceptable films, with the key missing piece being a real‑time “editing studio” interface.

Server Overload & Queue Delays

Weeks after release, users still report videos stuck at 99%, waiting 4–8 hours or even 2–3 days. Image‑to‑Video is especially slow. This suggests Kuaishou’s server infrastructure cannot keep up with demand.

Controversy Over “Being First”

Although the company claims Kling 2.6 is the “first” native 1080p + audio model, Sora 2 and Veo 3.1 already offer similar capabilities. Not all “first” claims are accurate.

Expensive Pricing

Extremely high cost for long videos: Generating 10 minutes of HD video with native voice is extremely expensive.

A 10‑second clip often takes 5–10 minutes, which is a major bottleneck for fast‑paced ad production. The Premier plan costs about $92/month and still has credit limits. By comparison, Runway Gen‑4 offers unlimited generation for $95.

Auto‑Renewal Problems

Some users report that even after clearly disabling auto‑renewal, the system re‑enabled it and charged them without authorization. The company refused refunds, calling it a “default policy.”

One user (New_Technology6614) reported that after canceling auto‑renewal last year, Kling AI still charged him without permission.

The renewal price was $70 higher than the original subscription, with no notice of the price increase.

After the website redesign, many users found their cancellation status reset to “enabled.”

Even when the charge itself was disputed, customer service responded only with a template message saying “no refunds.”

The official Discord’s “billing support” channel refuses to handle billing issues and directs users to an unresponsive email. When users discuss these issues publicly in the channel, their messages are deleted and they are muted (timeout) or banned.

Some long‑term loyal users (grandfathered users) originally had credits valid for two years, but after upgrading or downgrading their plan, the credits became monthly‑expiring, with no clear warning from the company.

Even European users protected by the EU 14‑day refund policy were denied refunds.

Protection Tips:

  • Lock your credit card: strongly recommended to lock the card immediately after use.
  • Virtual credit cards: use Privacy or other virtual cards with spending limits.
  • Chargeback: many users believe filing a “fraudulent transaction chargeback” with the bank is the only way to recover the money.

Kling 2.6 VS Veo 3.1

Both models currently support native audio‑video synchronized generation, meaning they can produce video along with matching speech and background sound at the same time.

Audio

Veo 3.1:

Delivers cleaner, more natural, and fuller audio performance. Its frequency spectrum is more complete, making it sound closer to real recording equipment.

Kling 2.6:

The audio quality of 2.6 has been criticized by many users. It sounds somewhat muffled, and sometimes even “underwater.” Some comments directly state that Kling’s voice quality is “really bad.”

Visuals and Prompt Execution: Mixed Results

I found that Kling 2.6 is more rule‑abiding and can execute complex prompt instructions more accurately.

Veo 3.1, on the other hand, tends to add its own “vibes” and details. While the visuals are more aesthetically pleasing, it sometimes deviates from the original instructions. It has more freedom to improvise as long as the overall atmosphere is met.

(For example, in the video, the female host added many extra actions and expressions.)

Pricing Comparison

Kling 2.6 costs about $0.056 per second, while Veo 3.1 Fast costs only $0.0125 per second, making the price difference several times apart.

In‑Depth Interpretation and Outlook

The release of Kling 2.6 marks the end of the “silent AI video” era. Its native audio‑video synchronized generation and HD output are significant technological milestones.

However, the model still has clear shortcomings in audio quality, visual naturalness, and fine‑grained functional control. Users also frequently complain about its high pricing and controversial auto‑renewal policies.

Overall, Kling 2.6 demonstrates enormous potential for AI filmmaking, but to reach truly “professional‑grade” production standards, substantial improvements are still needed in technical stability