Sketching with your voice

We produced a short animation where all sound effects were vocal imitations synthesized by our system. All 11 sound effects in this video are available in the tables below, along with the original sounds that were imitated (referents). The quiet siren is whispered (last table).

Recommended browsers: Chrome, Firefox, Safari.

The Full and Baseline methods are summarized in Section 3.4 of the paper.

Audio samples are top-1 prediction unless noted otherwise. Generated referents are looped twice in the audio clip for clarity.

Contents: Examples (hand-picked) | Random sample (masculine speaker) | Random sample (feminine speaker) | Whispering | Vocal enhancer model

Referent Name

Referent

Human

Full Method

Baseline Method

bell

racecar

door creak

horn section

not in VocalImitationSet

dripping water

not in VocalImitationSet

Referent Name

Referent

Human

Full Method

Baseline Method

Matches human

leaves rustling

cat hiss

tapping

glasses clinking

cat meow

microwave

siren

crow caw

motorboat

sawing

machine gun

rank-3 result

bees

Failure modes

organ

see paper for discussion of why our method does not support imitating music

string ensemble

see paper for discussion of why our method does not support imitating music

heartbeat

see paper for discussion of why our method does not match culturally-established onomatopoeia ("ba-boom")

stomach grumble

humans tend to use the velar fricative [x] to imitate this, which is out-of-scope for our vocal tract model; the model nontheless produces a qualitatively similar sound by other means

Referent Name

Referent

Human

Full Method

Baseline Method

Matches human

leaves rustling

cat hiss

tapping

glasses clinking

cat meow

microwave

siren

crow caw

motorboat

sawing

machine gun

rank-3 result

bees

Failure modes

organ

see paper for discussion of why our method does not support imitating music

string ensemble

see paper for discussion of why our method does not support imitating music

heartbeat

see paper for discussion of why our method does not match culturally-established onomatopoeia ("ba-boom")

stomach grumble

humans tend to use the velar fricative [x] to imitate this, which is out-of-scope for our vocal tract model; the model nontheless produces a qualitatively similar sound by other means

Referent Name

Referent

Full Method

Baseline Method

siren

string ensemble

leaves rustling

this imitation is typically unvoiced even without the whispering constraint; indeed, our method continues to produce the same utterance under the whispering condition

Referent Name

Pre-cleanup model

Post-cleanup model

siren

motorboat

Vocal foley demo

Audio clips

Examples

-

Random sample of 16 referents from VocalImitationSet (5,601 sounds)

-

Random sample of 16 referents from VocalImitationSet (5,601 sounds)

-

Whispered utterances

-

Effect of pretrained vocal enhancer model

Referent Name	Referent	Full Method	Baseline Method
siren
string ensemble
leaves rustling this imitation is typically unvoiced even without the whispering constraint; indeed, our method continues to produce the same utterance under the whispering condition