First things first: Where can I download this? — See the download-link below.
meSpeak.js (modulary enhanced speak.js) is a 100% client-side JavaScript text-to-speech library based on the speak.js project (see below).
meSpeak.js adds support for Webkit and Safari and introduces loadable voice modules. Also there is no more need for an embedding HTML-element.
Separating the code of the library from config-data and voice definitions should help future optimizations of the core part of speak.js.
All separated data has been compressed to base64-encoded strings from the original binary files to save some bandwidth (compared to JS-arrays of raw 8-bit data).
Browser requirements: Firefox, Chrome/Opera, Webkit, and Safari (MSIE11 is expected to be compliant).
meSpeak.js 2011-2015 by Norbert Landsteiner, mass:werk – media environments; http://www.masswerk.at/mespeak/
Important Changes:
v.1.1 adds support for the Web Audio API (AudioContext), which is now the preferred option for playback with the HTMLAudioElement as a fall-back.
Thanks to the new method of playback meSpeak.js was tested successfully with iOS/Safari (iOS 6).
Also starting with v.1.1 there is now an option to rather export the raw speech-data than playing the sound (see: options, "rawdata").
v.1.2 adds volume control and the capacity to play back cached streams generated using the "rawdata" option.
v.1.5 adds an optional callback-argument to the methods meSpeak.speak() and meSpeak.play().
v 1.6 adds support for voice-variants (like female voices) and includes the required definitions files.
v 1.7 finally supports the complete set of usable espeak-options.
v 1.9 adds meSpeak.stop() in order to stop all sounds or one or more spefic sound(s).
v 1.9.3 adds support for the complete Basic Latin and Latin-1 Supplement Unicode range (U+0000 .. U+00FF).
v 1.9.4 meSpeak now recovers gracefully and transparently from any internal FS errors (which might show up at the 80th call to meSpeak.speak()).
v 1.9.5 Added meSpeak.speakMultipart() in order to combine multiple voices.
v 1.9.7.1 Important Upadate — Fixes for changes in Web Audio API in Apple Safari 9.x (iOS and Mac OS X).
Some real world examples (at masswerk.at):
• Explore client-side speech I/O with E.L.I.Z.A. Talking
• Celebrating meSpeak.js v.1.5: JavaScript Doing The JavaScript Rap (featuring MC meSpeak) (a heavy performance test)
meSpeak.loadConfig("mespeak_config.json"); meSpeak.loadVoice('en-us.json'); meSpeak.speak('hello world'); meSpeak.speak('hello world', { option1: value1, option2: value2 .. }); meSpeak.speak('hello world', { option1: value1, option2: value2 .. }, myCallback); var id = meSpeak.speak('hello world'); meSpeak.stop(id); meSpeak.speak( text [, { option1: value1, option2: value2 .. } [, callback ]] ); text: The string of text to be spoken. The text may contain line-breaks ("\n") and special characters. Default text-encoding is UTF-8 (see the option "utf16" for other). options (eSpeak command-options): * amplitude: How loud the voice will be (default: 100) * pitch: The voice pitch (default: 50) * speed: The speed at which to talk (words per minute) (default: 175) * voice: Which voice to use (default: last voice loaded or defaultVoice, see below) * wordgap: Additional gap between words in 10 ms units (default: 0) * variant: One of the variants to be found in the eSpeak-directory "~/espeak-data/voices/!v" Variants add some effects to the normally plain voice, e.g. notably a female tone. Valid values are: "f1", "f2", "f3", "f4", "f5" for female voices "m1", "m2", "m3", "m4", "m5", "m6, "m7" for male voices "croak", "klatt", "klatt2", "klatt3", "whisper", "whisperf" for other effects. (Using eSpeak, these would be appended to the "-v" option by "+" and the value.) Note: Try "f2" or "f5" for a female voice. * linebreak: (Number) Line-break length, default value: 0. * capitals: (Number) Indicate words which begin with capital letters. 1: Use a click sound to indicate when a word starts with a capital letter, or double click if word is all capitals. 2: Speak the word "capital" before a word which begins with a capital letter. Other values: Increases the pitch for words which begin with a capital letter. The greater the value, the greater the increase in pitch. (eg.: 20) * punct: (Boolean or String) Speaks the names of punctuation characters when they are encountered in the text. If a string of characters is supplied, then only those listed punctuation characters are spoken, eg. { "punct": ".,;?" }. * nostop: (Boolean) Removes the end-of-sentence pause which normally occurs at the end of the text. * utf16: (Boolean) Indicates that the input is UTF-16, default: UTF-8. * ssml: (Boolean) Indicates that the text contains SSML (Speech Synthesis Markup Language) tags or other XML tags. (A small set of HTML is supported too.) further options (meSpeak.js specific): * volume: Volume relative to the global volume (number, 0..1, default: 1) Note: the relative volume has no effect on the export using option 'rawdata'. * rawdata: Do not play, return data only. The type of the returned data is derived from the value (case-insensitive) of 'rawdata': - 'base64': returns a base64-encoded string. - 'mime': returns a base64-encoded data-url (including the MIME-header). (synonyms: 'data-url', 'data-uri', 'dataurl', 'datauri') - 'array': returns a plain Array object with uint 8 bit data. - default (any other value): returns the generated wav-file as an ArrayBuffer (8-bit unsigned). Note: The value of 'rawdata' must evaluate to boolean 'true' in order to be recognized. * log: (Boolean) Logs the compiled eSpeak-command to the JS-console. callback: An optional callback function to be called after the sound output ended. The callback will be called with a single boolean argument indicating success. If the resulting sound is stopped by meSpeak.stop(), the success-flag will be set to false. Returns: * if called with option rawdata: a stream in the requested format (or null, if the required resources have not loaded yet). * default: a 32bit integer ID greater than 0 (or 0 on failure). The ID may be used to stop this sound by calling meSpeak.stop(<id>). if (meSpeak.isVoiceLoaded('de')) meSpeak.setDefaultVoice('de'); // note: the default voice is always the the last voice loaded meSpeak.loadVoice('fr.json', userCallback); // userCallback is an optional callback-handler. The callback will receive two arguments: // * a boolean flag for success // * either the id of the voice, or a reason for errors ('network error', 'data error', 'file error') alert(meSpeak.getDefaultVoice()); // 'fr' if (meSpeak.isConfigLoaded()) meSpeak.speak('Configuration data has been loaded.'); // note: any calls to speak() will be deferred, if no valid config-data has been loaded yet. meSpeak.setVolume(0.5); meSpeak.setVolume( volume [, id-list] ); Sets a volume level (0 <= v <= 1) * if called with a single argument, the method sets the global playback-volume, any sounds currently playing will be updated immediately with respect to their relative volume (if specified). * if called with more than a single argument, the method will set and adjust the relative volume of the sound(s) with corresponding ID(s). Returns: the volume provided. alert(meSpeak.getVolume()); // 0.5 meSpeak.getVolume( [id] ); Returns a volume level (0 <= v <= 1) * if called without an argument, the method returns the global playback-volume. * if called with an argument, the method will return the relative volume of the sound with the ID corresponding to the first argument. if no sound with a corresponding ID is found, the method will return 'undefined'. var browserCanPlayWavFiles = meSpeak.canPlay(); // test for compatibility // export speech-data as a stream (no playback): var myUint8Array = meSpeak.speak('hello world', { 'rawdata': true }); // typed array var base64String = meSpeak.speak('hello world', { 'rawdata': 'base64' }); var myDataUrl = meSpeak.speak('hello world', { 'rawdata': 'data-url' }); var myArray = meSpeak.speak('hello world', { 'rawdata': 'array' }); // simple array // playing cached streams (any of the export formats): meSpeak.play( stream [, relativeVolume [, callback]] ); var stream1 = meSpeak.speak('hello world', { 'rawdata': true }); var stream2 = meSpeak.speak('hello again', { 'rawdata': true }); var stream3 = meSpeak.speak('hello yet again', { 'rawdata': 'data-url' }); meSpeak.play(stream1); // using global volume meSpeak.play(stream2, 0.75); // 75% of global volume meSpeak.play(stream3); // v.1.4.2: play data-urls or base64-encoded var id = meSpeak.play(stream1); meSpeak.stop(id); Arguments: stream: A stream in any of the formats returned by meSpeak.play() with the "rawdata"-option. volume: (optional) Volume relative to the global volume (number, 0..1, default: 1) callback: (optional) A callback function to be called after the sound output ended. The callback will be called with a single boolean argument indicating success. If the sound is stopped by meSpeak.stop(), the success-flag will be set to false. (See also: meSpeak.speak().) Returns: A 32bit integer ID greater than 0 (or 0 on failure). The ID may be used to stop this sound by calling meSpeak.stop(<id>). meSpeak.stop( [<id-list>] ); Stops the sound(s) specified by the id-list. If called without an argument, all sounds currently playing, processed, or queued are stopped. Any callback(s) associated to the sound(s) will return false as the success-flag. Arguments: id-list: Any number of IDs returned by a call to meSpeak.speak() or meSpeak.play(). Returns: The number (integer) of sounds actually stopped.
Note on export formats, ArrayBuffer (typed array, defaul) vs. simple array:
The ArrayBuffer (8-bit unsigned) provides a stream ready to be played by the Web Audio API (as a value for a BufferSourceNode), while the plain array (JavaScript Array object) may be best for export (e.g. sending the data to Flash via Falsh's ExternalInterface). The default raw format (ArrayBuffer) is the preferred format for caching streams to be played later by meSpeak by calling meSpeak.play(), since it provides the least overhead in processing.
Using meSpeak.speakMultipart() you may mix multiple parts into a single utterance.
See the Multipart-Example for a demo.
The general form of meSpeak.speakMultipart() is analogous to meSpeak.speak(), but with an array of objects (the parts to be spoken) as the first argument (rather than a single text):
meSpeak.speakMultipart( <parts-array> [, <options-object> [, <callback-function> ]] ); meSpeak.speakMultipart( [ { text: "text-1", <other options> ] }, { text: "text-2", <other options> ] }, ... { text: "text-n", <other options> ] }, ], { option1: value1, option2: value2 .. }, callback );
Only the the first argument is mandatory, any further arguments are optional.
The parts-array must contain a single element (of type object) at least.
For any other options refer to meSpeak.speak(). Any options supplied as the second argument will be used as defaults for the individual parts. (Same options provided with the individual parts will override these defaults.)
The method returns — like meSpeak.speak() — either an ID, or, if called with the "rawdata" option (in the general options / second argument), a stream-buffer representing the generated wav-file.
iOS (currently supported only using Safari) provides a single audio-slot, playing only one sound at a time.
Thus, any concurrent calls to meSpeak.speak() or meSpeak.play() will stop any other sound playing.
Further, iOS reserves volume control to the user exclusively. Any attempt to change the volume by a script will remain without effect.
Please note that you still need a user-interaction at the very beginning of the chain of events in order to have a sound played by iOS.
The first set of options listed above corresponds directly to options of the espeak command. For details see the eSpeak command documentation.
The meSpeak.js-options and their espeak-counterparts are (mespeak.speak() accepts both sets, but prefers the long form):
meSpeak.js | eSpeak |
amplitude | -a |
wordgap | -g |
pitch | -p |
speed | -s |
voice | -v |
variant | -v<voice>+<variant> |
utf16 | -b 4 (default: -b 1) |
linebreak | -l |
capitals | -k |
nostop | -z |
ssml | -m |
punct | --punct[="<characters>"] |
1) Config-data: "mespeak_config.json":
The config-file includes all data to configure the tone (e.g.: male or female) of the electronic voice.
{ "config": "<base64-encoded octet stream>", "phontab": "<base64-encoded octet stream>", "phonindex": "<base64-encoded octet stream>", "phondata": "<base64-encoded octet stream>", "intonations": "<base64-encoded octet stream>" }
Finally the JSON object may include an optional voice-object (see below), that will be set up together with the config-data:
{ ... "voice": { <voice-data> } }
2) Voice-data: "voice.json":
A voice-file includes the ids of the voice and the dictionary used by this voice, and the binary data of theses two files.
{ "voice_id": "<voice-identifier>", "dict_id": "<dict-identifier>", "dict": "<base64-encoded octet stream>", "voice": "<base64-encoded octet stream>" }
Alternatively the value of "voice" may be a text-string, if an additional property "voice_encoding": "text" is provided.
This shold allow for quick changes and testing:
{ "voice_id": "<voice-identifier>", "dict_id": "<dict-identifier>", "dict": "<base64-encoded octet stream>", "voice": "<text-string>", "voice_encoding": "text" }
Both config-data and voice-data may be loaded and switched on the fly to (re-)configure meSpeak.js.
For a guide to customizing languages and voices, see meSpeak – Voices & Languages.
In order to support Mbrola voices and other voices requiring a more flexible layout and/or additional data, there is also an extended voice format:
{ "voice_id": "<voice-identifier>", "voice": "<base64-encoded octet stream>" "files": [ { "path", "<rel-pathname>", "data", "<base64-encoded octet stream>" }, { "path", "<rel-pathname>", "data", "<text-string>", "encoding": "text" }, ... ] }
or (using a text-encoded voice-definition):
{ "voice_id": "<voice-identifier>", "voice": "<text-string>", "voice_encoding": "text" "files": [ { "path", "<rel-pathname>", "data", "<base64-encoded octet stream>" }, { "path", "<rel-pathname>", "data", "<text-string>", "encoding": "text" }, ... ] }
Only a valid voice-definition is required and optionally an array "files" which may be empty or contain any number of objects, containing a property "path" (relative file-path from the espeak-data-directory) and a property "data", containing the file (either as base64-encoded data or as plain text, if there is also an optional property "encoding": "text").
In order to facilitate the use of Mbrola voices, for any "voice_id" beginning with "mb/mb-" only the part following the initial "mb/" will be used as the internal identifyer for the meSpeak.speak() method. (So any given voice_id "mb/mb-en1" will be translated to a voice "mb-en1" automatically. This applies to the speak-command only.)
Please don't ask for support on Mbrola voices (I don't have the faintest idea). Please refer to Mbrola section of the eSpeak documentation for a guide to setting up the required files locally. It should be possible to load these into meSpeak.js using the "extended voice format", since you may put any additional payload into the files-array. Please mind that you will still require a text-to-phoneme translator as stated in the eSpeak documentation (this is out of the scope of meSpeak.js).
In case that speak() is called before configuration and/or voice data has been loaded, the call will be deferred and executed after set up.
See this page for an example. You may reset the queue manually by calling
meSpeak.resetQueue();
There are now two separate parameters or options to control the volume of the spoken text: amplitude and volume.
While amplitude affects the generation of the sound stream by the TTS-algorithm, volume controls the playback volume of the browser. By the use of volume you can cache a generated stream and still provide an individual volume level at playback time. Please note that there is a global volume (controlled by setVolume()) and an individual volume level relative to the global one. Both default to 1 (max volume).
Please note that the Chinese voices do only support Pinyin input (phonetic transcript like "zhong1guo2" for 中 + 国, China) for "zh" and simple one-to-one translation from single Simplified Chinese characters or Jyutping romanised text for "zh-yue".
The eSpeak documentation provides the following notes:
*) zh (Mandarin Chinese):
This speaks Pinyin text and Chinese characters. There is only a simple one-to-one translation of Chinese characters to a single Pinyin pronunciation. There is no attempt yet at recognising different pronunciations of Chinese characters in context, or of recognising sequences of characters as "words". The eSpeak installation includes a basic set of Chinese characters. More are available in an additional data file for Mandarin Chinese at: http://espeak.sourceforge.net/data/.
**) zh-yue (Cantonese Chinese, Provisional):
Just a naive simple one-to-one translation from single Simplified Chinese characters to phonetic equivalents in Cantonese. There is limited attempt at disambiguation, grouping characters into words, or adjusting tones according to their surrounding syllables. This voice needs Chinese character to phonetic translation data, which is available as a separate download for Cantonese at: http://espeak.sourceforge.net/data/.
The voice can also read Jyutping romanised text.
For a simple zh-to-Pinyin translation in JavaScript see: http://www.masswerk.at/mespeak/zh-pinyin-translator.zip
(m)eSpeak produces internally wav-files, which are then played. Internet Explorer 10 supports typed arrays (which are required for the binary logic), but does not provide native playback of wav-files. To provide compatibility for this browser, you could try the experimental meSpeak Flash Fallback.
Download (all code under GPL): mespeak.zip
(v.1.9.7.1, last update: 2015-10-07 13:00 GMT)
/* Cross-Browser Web Audio API Playback With Chrome And Callbacks */ // alias the Web Audio API AudioContext-object var aliasedAudioContext = window.AudioContext || window.webkitAudioContext; // ugly user-agent-string sniffing var isChrome = ((typeof navigator !== 'undefined') && navigator.userAgent && navigator.userAgent.indexOf('Chrome') !== -1); var chromeVersion = (isChrome)? parseInt( navigator.userAgent.replace(/^.*?\bChrome\/([0-9]+).*$/, '$1'), 10 ) : 0; function playSound(streamBuffer, callback) { // set up a BufferSource-node var audioContext = new aliasedAudioContext(); var source = audioContext.createBufferSource(); source.connect(audioContext.destination); // since the ended-event isn't generally implemented, // we need to use the decodeAudioData()-method in order // to extract the duration to be used as a timeout-delay audioContext.decodeAudioData(streamBuffer, function(audioData) { // detect any implementation of the ended-event // Chrome added support for the ended-event lately, // but it's unreliable (doesn't fire every time) // so let's exclude it. if (!isChrome && source.onended !== undefined) { // we could also use "source.addEventListener('ended', callback, false)" here source.onended = callback; } else { var duration = audioData.duration; // convert to msecs // use a default of 1 sec, if we lack a valid duration var delay = (duration)? Math.ceil(duration * 1000) : 1000; setTimeout(callback, delay); } // finally assign the buffer source.buffer = audioData; // start playback for Chrome >= 32 // please note that this would be without effect on iOS, since we're // inside an async callback and iOS requires direct user interaction if (chromeVersion >= 32) source.start(0); }, function(error) { /* decoding-error-callback */ }); // normal start of playback, this would be essentially autoplay // but is without any effect in Chrome 32 // let's exclude Chrome 32 and higher to avoid any double calls anyway if (!isChrome || chromeVersion < 32) { if (source.start) { source.start(0); } else { source.noteOn(0); } } }
speak.js is 100% clientside JavaScript. "speak.js" is a port of eSpeak, an open source speech synthesizer, which was compiled from C++ to JavaScript using Emscripten.
The project page and source code for this demo can be found here.
Note: There had been initially plans to merge this project with speak.js, but they somehow became stuck.
Browser requirements: