Sketching with your voice

Demo & sound examples

Vocal foley demo

We produced a short animation where all sound effects were vocal imitations synthesized by our system. All 11 sound effects in this video are available in the tables below, along with the original sounds that were imitated (referents). The quiet siren is whispered (last table).

Audio clips

Recommended browsers: Chrome, Firefox, Safari.

The Full and Baseline methods are summarized in Section 3.4 of the paper.

Audio samples are top-1 prediction unless noted otherwise. Generated referents are looped twice in the audio clip for clarity.

Contents: Examples (hand-picked) | Random sample (masculine speaker) | Random sample (feminine speaker) | Whispering | Vocal enhancer model

Examples

Referent Name Referent Human Full Method Baseline Method
bell
racecar
door creak
horn section not in VocalImitationSet
dripping water not in VocalImitationSet

-

Random sample of 16 referents from VocalImitationSet (5,601 sounds)

Vocal tract model tuned to stereotypically masculine speaker

Referent Name Referent Human Full Method Baseline Method
Matches human
leaves rustling
cat hiss
tapping
glasses clinking
cat meow
microwave
siren
crow caw
motorboat
sawing
machine gun

rank-3 result

bees
Failure modes
organ

see paper for discussion of why our method does not support imitating music

string ensemble

see paper for discussion of why our method does not support imitating music

heartbeat

see paper for discussion of why our method does not match culturally-established onomatopoeia ("ba-boom")

stomach grumble

humans tend to use the velar fricative [x] to imitate this, which is out-of-scope for our vocal tract model; the model nontheless produces a qualitatively similar sound by other means

-

Random sample of 16 referents from VocalImitationSet (5,601 sounds)

Vocal tract model tuned to stereotypically feminine speaker

Referent Name Referent Human Full Method Baseline Method
Matches human
leaves rustling
cat hiss
tapping
glasses clinking
cat meow
microwave
siren
crow caw
motorboat
sawing
machine gun

rank-3 result

bees
Failure modes
organ

see paper for discussion of why our method does not support imitating music

string ensemble

see paper for discussion of why our method does not support imitating music

heartbeat

see paper for discussion of why our method does not match culturally-established onomatopoeia ("ba-boom")

stomach grumble

humans tend to use the velar fricative [x] to imitate this, which is out-of-scope for our vocal tract model; the model nontheless produces a qualitatively similar sound by other means

-

Whispered utterances

Referent Name Referent Full Method Baseline Method
siren
string ensemble
leaves rustling

this imitation is typically unvoiced even without the whispering constraint; indeed, our method continues to produce the same utterance under the whispering condition

-

Effect of pretrained vocal enhancer model

Referent Name Pre-cleanup model Post-cleanup model
siren
motorboat
back to top