Demo & sound examples
We produced a short animation where all sound effects were vocal imitations synthesized by our system. All 11 sound effects in this video are available in the tables below, along with the original sounds that were imitated (referents). The quiet siren is whispered (last table).
Recommended browsers: Chrome, Firefox, Safari.
The Full and Baseline methods are summarized in Section 3.4 of the paper.
Audio samples are top-1 prediction unless noted otherwise. Generated referents are looped twice in the audio clip for clarity.
Referent Name | Referent | Human | Full Method | Baseline Method |
---|---|---|---|---|
bell | ||||
racecar | ||||
door creak | ||||
horn section | not in VocalImitationSet | |||
dripping water | not in VocalImitationSet |
Vocal tract model tuned to stereotypically masculine speaker
Referent Name | Referent | Human | Full Method | Baseline Method |
---|---|---|---|---|
Matches human | ||||
leaves rustling | ||||
cat hiss | ||||
tapping | ||||
glasses clinking | ||||
cat meow | ||||
microwave | ||||
siren | ||||
crow caw | ||||
motorboat | ||||
sawing | ||||
machine gun
rank-3 result |
||||
bees | ||||
Failure modes | ||||
organ
see paper for discussion of why our method does not support imitating music |
||||
string ensemble
see paper for discussion of why our method does not support imitating music |
||||
heartbeat
see paper for discussion of why our method does not match culturally-established onomatopoeia ("ba-boom") |
||||
stomach grumble
humans tend to use the velar fricative [x] to imitate this, which is out-of-scope for our vocal tract model; the model nontheless produces a qualitatively similar sound by other means |
Vocal tract model tuned to stereotypically feminine speaker
Referent Name | Referent | Human | Full Method | Baseline Method |
---|---|---|---|---|
Matches human | ||||
leaves rustling | ||||
cat hiss | ||||
tapping | ||||
glasses clinking | ||||
cat meow | ||||
microwave | ||||
siren | ||||
crow caw | ||||
motorboat | ||||
sawing | ||||
machine gun
rank-3 result |
||||
bees | ||||
Failure modes | ||||
organ
see paper for discussion of why our method does not support imitating music |
||||
string ensemble
see paper for discussion of why our method does not support imitating music |
||||
heartbeat
see paper for discussion of why our method does not match culturally-established onomatopoeia ("ba-boom") |
||||
stomach grumble
humans tend to use the velar fricative [x] to imitate this, which is out-of-scope for our vocal tract model; the model nontheless produces a qualitatively similar sound by other means |
Referent Name | Referent | Full Method | Baseline Method |
---|---|---|---|
siren | |||
string ensemble | |||
leaves rustling
this imitation is typically unvoiced even without the whispering constraint; indeed, our method continues to produce the same utterance under the whispering condition |
Referent Name | Pre-cleanup model | Post-cleanup model |
---|---|---|
siren | ||
motorboat |