Guide 7 min read

How to Write Prompts That Make AI Music Videos Look Good

What separates an AI music video prompt that works from one that produces slop in 2026: structure, references, negative prompts, iteration.

How to Write Prompts That Make AI Music Videos Look Good
K

Kevin Gabeci

Most AI video prompts you see online are bad. Not because the people writing them are bad, but because the format that works for image generation does not transfer cleanly to video. Video has time, motion, and continuity built in, and the prompt needs to encode all three. A good 2026 prompt is closer to a one paragraph shot list than a string of adjectives.

This piece is the version of the prompt structure that I run when the goal is a usable shot in a music video, not a viral output to post on Twitter. The two are different. Music video shots have to belong next to other shots. Most of what makes a prompt good is making the shot consistent with the rest of the video, not making it impressive on its own.

The seven element prompt

Every shot prompt I write has the same seven elements, in roughly the same order. You can skip elements when the model defaults are fine, but you cannot reorder them without losing quality on most current models.

  1. Shot size
  2. Lens and camera move
  3. Lighting and time of day
  4. Style anchor (reference image or named aesthetic)
  5. Subject and action
  6. Motion direction
  7. Negative prompt

Walk through them in order and you have a prompt the model can actually parse.

Step 1: Decide the shot size

Open the prompt with the framing. Wide, medium, or close up. This single word does more than anything else in the prompt because the model uses it as the architectural anchor for everything else.

Wide shot establishes the world. Use it for the first shot of a scene, an establishing beat, or any moment where the subject is small inside their environment.

Medium shot shows action. Subject from waist up, surroundings still visible. This is the workhorse of music videos. Most of your shots should be medium.

Close up shows emotion. Face, hands, an object. Use sparingly. A music video that is all close ups feels claustrophobic.

Step 2: Specify the lens and camera move

After framing, tell the model what the camera is doing. Two parts:

  • Focal length feel. “24mm wide angle” gives you distortion and a sense of scale. “50mm natural” reads as documentary. “85mm portrait” gives you compressed background and shallow depth of field.
  • Camera move. “Static” if the camera does not move. “Slow push in” if the camera moves toward the subject. “Dolly left” or “dolly right” for lateral motion. “Handheld” for slight wobble that reads as natural.

Models trained on film and TV data recognize these descriptors and render them. Vague camera language (“dynamic shot”, “cinematic angle”) does the opposite, it tells the model nothing and you get whatever default it had loaded.

Step 3: Set lighting and time of day

Lighting is mood. The same scene shot in golden hour and shot under fluorescent reads as two different songs.

Lighting vocabulary the models recognize:

  • Golden hour, blue hour, magic hour
  • Harsh midday sun, overcast diffused
  • Neon glow, sodium streetlamp
  • Candlelight, firelight, lantern
  • Moonlight, starlight, no moon
  • Hard fluorescent, soft tungsten, mixed practical

Pair with a time of day so the model has a temporal anchor. “3am, neon glow, rain on pavement” gives the model enough to lock the entire palette.

The cinematic AI music video prompts piece has a longer catalog of lighting setups by genre.

Step 4: Anchor the style with references or a named aesthetic

If your tool supports image references, use them. The model is dramatically better at copying a reference than parsing your description of one. Two or three reference images that share a palette and a mood will pull the output toward that aesthetic without you having to write a single style adjective.

If you do not have references, name one aesthetic the model recognizes. One aesthetic, not five. Five compete and the model averages. Examples that work:

  • 35mm film grain, faded
  • Anime cel shading, Studio Ghibli warmth
  • Brutalist concrete, harsh shadows
  • Vaporwave, pink and teal, VHS artifacts
  • Dark academia, candlelit library
  • Wes Anderson symmetrical, pastel

One aesthetic, repeated identically across every scene prompt in the video. That is how you get visual continuity.

Step 5: Add motion direction

A common rookie mistake is writing a prompt that describes a still image, then complaining that the resulting video is barely moving. Tell the model what should move.

  • Hair blowing in the wind
  • Rain falling diagonally, splashing on pavement
  • Slow zoom on her eyes
  • Steam rising from the coffee
  • Leaves rustling in the trees
  • Cars passing in the background

Motion direction is what separates an AI music video shot from an AI photograph that has been animated. Models default to minimal motion if you do not push them. Push them.

Step 6: Set a negative prompt

Negative prompts tell the model what to avoid. Most slop you see in AI video has the same root causes, and a negative prompt cleans them up.

A reusable negative prompt I keep in my notes:

no text, no watermark, no captions, no subtitles, no logos, no deformed hands, no extra fingers, no extra limbs, no blur, no low resolution, no cartoon look (when going for realistic), no plastic skin

Tune by genre. Anime style? Drop the cartoon line. Realistic horror? Drop the deformed hands line because deformed hands might be the point. Negative prompts are not a fixed list. They are a list you adjust to the output you keep getting.

Step 7: Iterate from a base seed

When a generation almost works, lock the seed and change one element at a time. This is the single most underused technique in AI video work.

If you change the prompt and the seed at the same time, you cannot tell whether the new output is different because of your edit or because of the new seed. Lock the seed, change one word, generate again, look at the diff. That is how you learn which words actually move the model.

Set a hard limit on iterations. Three to five generations per scene. If the fifth generation is not better than the third, the prompt is wrong, not your luck. Rewrite the prompt rather than regenerate again.

Putting it together

A full prompt that has all seven elements:

Medium shot, 50mm natural lens, slow push in. 3am, neon glow, light rain on pavement. 35mm film grain, faded color palette. A woman in a red coat walks past a closed record store, neon sign flickering in the window. Hair lightly blowing, raindrops catching the neon. Negative: no text, no watermark, no deformed hands, no blur.

That prompt will not give you a perfect shot every time. It will give you a shot you can iterate from. That is the actual goal. Perfect on first generation is not how this works.

The Runway vs Pika vs Luma comparison covers how each model parses these elements differently, which helps you tune the prompt to the tool you are using.

A note on prompt overfitting

Once you write a prompt that works, you will be tempted to reuse it across every scene. Resist. The seven element structure stays the same. The content inside each element should change scene to scene to match what the song is doing.

A music video where every shot uses the same lighting and the same camera move is not a music video. It is a slideshow. Vary lighting across sections (verse, chorus, bridge). Vary camera moves across shots. Keep the style anchor and the negative prompt constant. Move everything else.

If you want a working environment that handles the seven element prompt structure for you and lets you iterate scene by scene without leaving the project, try Melodex. The prompt format above is the same one the platform’s scene editor is built around, so the patterns transfer cleanly.

Frequently asked questions

How long should an AI video prompt be?
For 2026 image to video models, useful prompts run 40 to 120 words. Shorter than 40 and the model fills in random defaults. Longer than 120 and the model starts ignoring the back half. Aim for the dense middle.
Do negative prompts actually work?
Yes on most current models, with caveats. Runway, Pika, and Luma all parse negative prompts but each one weights them differently. Test one negative term at a time so you know which actually moved the output.
Should I use one big prompt or multiple short ones?
Multiple short ones for a music video. Each scene gets its own prompt that shares a global style brief with the others. One big prompt produces a slideshow of unrelated shots.
How important are reference images?
Very. Reference images do more than any sentence in the prompt. If the model supports image to video or style reference, use it. The model is much better at copying a reference than parsing your description of it.
Why do my prompts produce text and watermarks?
Models trained on stock footage and YouTube data sometimes hallucinate watermarks, captions, and text overlays. Add 'no text, no watermark, no captions, no logos' to the negative prompt and regenerate.
Should I name a director or DP for style?
Naming a famous director gets you a stylistic ballpark but the output rarely lands cleanly. Naming a lighting style (golden hour, harsh fluorescent, neon glow) and a camera move (slow push in, handheld) gets you further than name dropping.
How many iterations until a scene works?
Set a hard limit of three to five generations per scene. If the fifth try is not better than the third, your prompt is wrong, not your luck. Rewrite, do not regenerate.

Keep reading