Skip to main content
Placeholder cover for the soundscape system post.

Soundscapes for Web Games: The Layer Beneath Your SFX - Part 1

Part 1 of 2. Part 2 is the Howler implementation: a ~150-line engine, a six-function bus adapter, and host bindings for React, Phaser, and three.js.

I’ve always had opinions about how web games sound. I think most game devs do, but we mostly keep them to ourselves because audio is the part of the stack nobody owns. So here’s mine: most web games sound flat, and it’s not because the samples are bad. It’s because the game treats audio as event SFX. A footstep here, a click there, a music loop in the background. The space between those events is silent, and silence is what makes a room feel empty before anything happens.

The fix is a layer underneath the events. I call it a Soundscape, and it’s what tells your ear “you are somewhere” before a single thing is interacted with.

Soundscape vs music vs SFX

These three get conflated constantly. They are not the same thing, and they really do not want the same treatment from your mixer.

LayerWhat it isTriggered byWants
MusicA composed, looping trackScene or beatIts own bus, fade in/out, ducks rarely
SFXOne-shot reactions to gameplayPlayer or gameTight latency, no fades
SoundscapeAmbient bed plus randomized non-musical emittersA timer, not youA bus that ducks under dialog

The Soundscape has two parts. The Bed is a looping ambience (room tone, wind, distant traffic). The Emitters are short non-musical samples scheduled on a timer with randomized pan, volume, and pitch (a creaking floorboard, a bird, a far-off door). Together they read as “this place exists.”

That’s the whole idea.

Why it matters more on the web than native

Native engines hand you a spatial mixer, an audio graph, bus routing, and priority systems for free. Unity has AudioMixerGroup. Unreal has Submixes and Sound Cues. You wire the graph in an editor and forget about it.

The web hands you AudioContext 🔗, a GainNode 🔗, and a polite suggestion to figure the rest out yourself. Phaser ships WebAudioSound 🔗; three.js ships PositionalAudio 🔗 (and drei wraps it as <PositionalAudio> 🔗 for r3f). Neither gives you scheduled non-musical ambience. Neither gives you a bus that can duck a category of sounds while a voice-over plays.

Here’s our WebAudioSound class. We’ll give you play(), pause(), and a volume number. Go figure out the rest.

That gap is why most web games never get past “music + SFX.” The work is not hard. Nothing in your engine of choice nudges you toward doing it.


So what does the layer actually look like in practice ? Three contexts cover most cases.

What “good” looks like in three contexts

The same engine fits all three. What changes is how much you lean on each part.

2D Phaser games. Bed keyed to the scene. Emitters scheduled in scene update or via timers. The listener position rarely moves, so pan is decorative rather than spatial. The win here is variety: three or four samples per emitter with last-picked exclusion turns a static loop into something the ear cannot lock onto. Cheap, huge.

3D three.js or r3f games. Bed is still 2D. A room tone has no position; it is everywhere. Emitters can be either 2D-randomized or true positional via PositionalAudio 🔗 for things that should pan as the camera turns. Ducking starts to matter more here because dialog and narration get buried under ambience faster in a 3D scene, where you also have music and footsteps and UI all competing.

Narrative scenes. Ducking is the single biggest win. Without it, voice-over fights ambience and you end up cranking the VO bus until ambience is a whisper. With a 250ms duck on the ambience bus, the VO sits naturally and the room comes back as soon as the line ends.

The room comes back. That’s the goal.

The three knobs that buy the most realism

Most of what makes a Soundscape feel alive comes down to three things. Skip any of them and it sounds like a tape loop.

A Sample Pool per emitter, with last-picked exclusion. Each emitter (say, “creaking floorboard”) holds three to six short samples. Each fire picks one at random, but never the one it just played. This is the single biggest perceptual upgrade per byte of work. The ear locks onto immediate repeats faster than it locks onto anything else.

Random pan, volume, and pitch within tight ranges. Pan ±0.4, volume 0.7 to 1.0, rate 0.95 to 1.05. Tight enough that nothing sounds broken; wide enough that no two fires are identical. Pitch variance is the easiest one to overdo. Past about ±5%, your samples start sounding chipmunky or sluggish. Ask me how I know.

A Concurrency Cap. maxConcurrentEmitterInstances is the cheap insurance policy. A stretch of bad RNG will occasionally fire three emitters within 200ms; without a cap, that stretch craters your mix and clips the master. Cap it at three or four and the cap will fire maybe once a minute, inaudibly.

Three knobs. That’s it.

Where’s the catch?

There is no free lunch, so:

Anti-patterns

A short list of things that look like they should work and do not.

Track your timers. Always has been.

What’s next

Part 2 walks through a pluggable implementation: one engine-agnostic class, a six-function bus adapter contract, and three host bindings (React, Phaser, imperative three.js). The engine is about 150 lines. What varies between projects is the bus underneath it and the lifecycle binding above it.

Both are smaller than you would guess.

Stay in touch

Don't miss out on new posts or project updates. Hit me up on X (Twitter) for updates, queries, or some good ol' tech talk.

Follow @zkMake
Zubin's Profile Written by Zubin