Request for Comments: 4463 Cisco Systems, Inc.
Category: Informational P. Monaco
Nuance Communications
B. Eberman
Speechworks Inc.
April 2006
A Media Resource Control Protocol (MRCP)
Developed by Cisco, Nuance, and Speechworks
Status of This Memo
This memo provides information for the Internet community. It does
not specify an Internet standard of any kind. Distribution of this
memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (2006).
IESG Note
This RFC is not a candidate for any level of Internet Standard. The
IETF disclaims any knowledge of the fitness of this RFC for any
purpose and in particular notes that the decision to publish is not
based on IETF review for such things as security, congestion control,
or inappropriate interaction with deployed protocols. The RFC Editor
has chosen to publish this document at its discretion. Readers of
this document should exercise caution in evaluating its value for
implementation and deployment. See RFC 3932 for more information.
Note that this document uses a MIME type ’application/mrcp’ which has
not been registered with the IANA, and is therefore not recognized as
a standard IETF MIME type. The historical value of this document as
an ancestor to ongoing standardization in this space, however, makes
the publication of this document meaningful.
Abstract
This document describes a Media Resource Control Protocol (MRCP) that
was developed jointly by Cisco Systems, Inc., Nuance Communications,
and Speechworks, Inc. It is published as an RFC as input for further
IETF development in this area.
MRCP controls media service resources like speech synthesizers,
recognizers, signal generators, signal detectors, fax servers, etc.,
over a network. This protocol is designed to work with streaming
protocols like RTSP (Real Time Streaming Protocol) or SIP (Session
Initiation Protocol), which help establish control connections to
external media streaming devices, and media delivery mechanisms like
RTP (Real Time Protocol).
Table of Contents
1. Introduction ....................................................3
2. Architecture ....................................................4
2.1. Resources and Services .....................................4
2.2. Server and Resource Addressing .............................5
3. MRCP Protocol Basics ............................................5
3.1. Establishing Control Session and Media Streams .............5
3.2. MRCP over RTSP .............................................6
3.3. Media Streams and RTP Ports ................................8
4. Notational Conventions ..........................................8
5. MRCP Specification ..............................................9
5.1. Request ...................................................10
5.2. Response ..................................................10
5.3. Event .....................................................12
5.4. Message Headers ...........................................12
6. Media Server ...................................................19
6.1. Media Server Session ......................................19
7. Speech Synthesizer Resource ....................................21
7.1. Synthesizer State Machine .................................22
7.2. Synthesizer Methods .......................................22
7.3. Synthesizer Events ........................................23
7.4. Synthesizer Header Fields .................................23
7.5. Synthesizer Message Body ..................................29
7.6. SET-PARAMS ................................................32
7.7. GET-PARAMS ................................................32
7.8. SPEAK .....................................................33
7.9. STOP ......................................................34
7.10. BARGE-IN-OCCURRED ........................................35
7.11. PAUSE ....................................................37
7.12. RESUME ...................................................37
7.13. CONTROL ..................................................38
7.14. SPEAK-COMPLETE ...........................................40
7.15. SPEECH-MARKER ............................................41
8. Speech Recognizer Resource .....................................42
8.1. Recognizer State Machine ..................................42
8.2. Recognizer Methods ........................................42
8.3. Recognizer Events .........................................43
8.4. Recognizer Header Fields ..................................43
8.5. Recognizer Message Body ...................................51
8.6. SET-PARAMS ................................................56
8.7. GET-PARAMS ................................................56
8.8. DEFINE-GRAMMAR ............................................57
8.9. RECOGNIZE .................................................60
8.10. STOP .....................................................63
8.11. GET-RESULT ...............................................64
8.12. START-OF-SPEECH ..........................................64
8.13. RECOGNITION-START-TIMERS .................................65
8.14. RECOGNITON-COMPLETE ......................................65
8.15. DTMF Detection ...........................................67
9. Future Study ...................................................67
10. Security Considerations .......................................67
11. RTSP-Based Examples ...........................................67
12. Informative References ........................................74
Appendix A. ABNF Message Definitions ..............................76
Appendix B. Acknowledgements ......................................84
1. Introduction
The Media Resource Control Protocol (MRCP) is designed to provide a
mechanism for a client device requiring audio/video stream processing
to control processing resources on the network. These media
processing resources may be speech recognizers (a.k.a. Automatic-
Speech-Recognition (ASR) engines), speech synthesizers (a.k.a. Text-
To-Speech (TTS) engines), fax, signal detectors, etc. MRCP allows
implementation of distributed Interactive Voice Response platforms,
for example VoiceXML [6] interpreters. The MRCP protocol defines the
requests, responses, and events needed to control the media
processing resources. The MRCP protocol defines the state machine
for each resource and the required state transitions for each request
and server-generated event.
The MRCP protocol does not address how the control session is
established with the server and relies on the Real Time Streaming
Protocol (RTSP) [2] to establish and maintain the session. The
session control protocol is also responsible for establishing the
media connection from the client to the network server. The MRCP
protocol and its messaging is designed to be carried over RTSP or
another protocol as a MIME-type similar to the Session Description
Protocol (SDP) [5].
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [8].
2. Architecture
The system consists of a client that requires media streams generated
or needs media streams processed and a server that has the resources
or devices to process or generate the streams. The client
establishes a control session with the server for media processing
using a protocol such as RTSP. This will also set up and establish
the RTP stream between the client and the server or another RTP
endpoint. Each resource needed in processing or generating the
stream is addressed or referred to by a URL. The client can now use
MRCP messages to control the media resources and affect how they
process or generate the media stream.
|--------------------|
||------------------|| |----------------------|
|| Application Layer|| ||--------------------||
||------------------|| || TTS | ASR | Fax ||
|| ASR/TTS API || ||Plugin|Plugin|Plugin||
||------------------|| || on | on | on ||
|| MRCP Core || || MRCP | MRCP | MRCP ||
|| Protocol Stack || ||--------------------||
||------------------|| || RTSP Stack ||
|| RTSP Stack || || ||
||------------------|| ||--------------------||
|| TCP/IP Stack ||========IP=========|| TCP/IP Stack ||
||------------------|| ||--------------------||
|--------------------| |----------------------|
MRCP client Real-time Streaming MRCP
media server
2.1. Resources and Services
The server is set up to offer a certain set of resources and services
to the client. These resources are of 3 types.
Transmission Resources
These are resources that are capable of generating real-time streams,
like signal generators that generate tones and sounds of certain
frequencies and patterns, and speech synthesizers that generate
spoken audio streams, etc.
Reception Resources
These are resources that receive and process streaming data like
signal detectors and speech recognizers.
Dual Mode Resources
These are resources that both send and receive data like a fax
resource, capable of sending or receiving fax through a two-way RTP
stream.
2.2. Server and Resource Addressing
The server as a whole is addressed using a container URL, and the
individual resources the server has to offer are reached by
individual resource URLs within the container URL.
RTSP Example:
A media server or container URL like,
rtsp://mediaserver.com/media/
may contain one or more resource URLs of the form,
rtsp://mediaserver.com/media/speechrecognizer/
rtsp://mediaserver.com/media/speechsynthesizer/
rtsp://mediaserver.com/media/fax/
3. MRCP Protocol Basics
The message format for MRCP is text based, with mechanisms to carry
embedded binary data. This allows data like recognition grammars,
recognition results, synthesizer speech markup, etc., to be carried
in the MRCP message between the client and the server resource. The
protocol does not address session control management, media
management, reliable sequencing, and delivery or server or resource
addressing. These are left to a protocol like SIP or RTSP. MRCP
addresses the issue of controlling and communicating with the
resource processing the stream, and defines the requests, responses,
and events needed to do that.
3.1. Establishing Control Session and Media Streams
The control session between the client and the server is established
using a protocol like RTSP. This protocol will also set up the
appropriate RTP streams between the server and the client, allocating
ports and setting up transport parameters as needed. Each control
session is identified by a unique session-id. The format, usage, and
life cycle of the session-id is in accordance with the RTSP protocol.
The resources within the session are addressed by the individual
resource URLs.
The MRCP protocol is designed to work with and tunnel through another
protocol like RTSP, and augment its capabilities. MRCP relies on
RTSP headers for sequencing, reliability, and addressing to make sure
that messages get delivered reliably and in the correct order and to
the right resource. The MRCP messages are carried in the RTSP
message body. The media server delivers the MRCP message to the
appropriate resource or device by looking at the session-level
message headers and URL information. Another protocol, such as SIP
[4], could be used for tunneling MRCP messages.
3.2. MRCP over RTSP
RTSP supports both TCP and UDP mechanisms for the client to talk to
the server and is differentiated by the RTSP URL. All MRCP based
media servers MUST support TCP for transport and MAY support UDP.
In RTSP, the ANNOUNCE method/response MUST be used to carry MRCP
request/responses between the client and the server. MRCP messages
MUST NOT be communicated in the RTSP SETUP or TEARDOWN messages.
Currently all RTSP messages are request/responses and there is no
support for asynchronous events in RTSP. This is because RTSP was
designed to work over TCP or UDP and, hence, could not assume
reliability in the underlying protocol. Hence, when using MRCP over
RTSP, an asynchronous event from the MRCP server is packaged in a
server-initiated ANNOUNCE method/response communication. A future
RTSP extension to send asynchronous events from the server to the
client would provide an alternate vehicle to carry such asynchronous
MRCP events from the server.
An RTSP session is created when an RTSP SETUP message is sent from
the client to a server and is addressed to a server URL or any one of
its resource URLs without specifying a session-id. The server will
establish a session context and will respond with a session-id to the
client. This sequence will also set up the RTP transport parameters
between the client and the server, and then the server will be ready
to receive or send media streams. If the client wants to attach an
additional resource to an existing session, the client should send
that session’s ID in the subsequent SETUP message.
When a media server implementing MRCP over RTSP receives a PLAY,
RECORD, or PAUSE RTSP method from an MRCP resource URL, it should
respond with an RTSP 405 "Method not Allowed" response. For these
resources, the only allowed RTSP methods are SETUP, TEARDOWN,
DESCRIBE, and ANNOUNCE.
Example 1:
C->S: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
CSeq:4
Session:12345678
Content-Type:application/mrcp
Content-Length:223
SPEAK 543257 MRCP/1.0
Voice-gender:neutral
Voice-category:teenager
Prosody-volume:medium
Content-Type:application/synthesis+ssml
Content-Length:104
<?xml version="1.0"?>
<speak>
<paragraph>
<sentence>You have 4 new messages.</sentence>
<sentence>The first is from <say-as
type="name">Stephanie Williams</say-as>
and arrived at <break/>
<say-as type="time">3:45pm</say-as>.</sentence>
<sentence>The subject is <prosody
rate="-20%">ski trip</prosody></sentence>
</paragraph>
</speak>
S->C: RTSP/1.0 200 OK
CSeq: 4
Session:12345678
RTP-Info:url=rtsp://media.server.com/media/synthesizer;
seq=9810092;rtptime=3450012
Content-Type:application/mrcp
Content-Length:52
MRCP/1.0 543257 200 IN-PROGRESS
S->C: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0
CSeq:6
Session:12345678
Content-Type:application/mrcp
Content-Length:123
SPEAK-COMPLETE 543257 COMPLETE MRCP/1.0
C->S: RTSP/1.0 200 OK
CSeq:6
For the sake of brevity, most examples from here on show only the
MRCP messages and do not show the RTSP message and headers in which
they are tunneled. Also, RTSP messages such as response that are not
carrying an MRCP message are also left out.
3.3. Media Streams and RTP Ports
A single set of RTP/RTCP ports is negotiated and shared between the
MRCP client and server when multiple media processing resources, such
as automatic speech recognition (ASR) engines and text to speech
(TTS) engines, are used for a single session. The individual
resource instances allocated on the server under a common session
identifier will feed from/to that single RTP stream.
The client can send multiple media streams towards the server,
differentiated by using different synchronized source (SSRC)
identifier values. Similarly the server can use multiple
Synchronized Source (SSRC) identifier values to differentiate media
streams originating from the individual transmission resource URLs if
more than one exists. The individual resources may, on the other
hand, work together to send just one stream to the client. This is
up to the implementation of the media server.
4. Notational Conventions
Since many of the definitions and syntax are identical to HTTP/1.1,
this specification only points to the section where they are defined
rather than copying it. For brevity, [HX.Y] refers to Section X.Y of
the current HTTP/1.1 specification (RFC 2616 [1]).
All the mechanisms specified in this document are described in both
prose and an augmented Backus-Naur form (ABNF) similar to that used
in [H2.1]. It is described in detail in RFC 4234 [3].
The ABNF provided along with the descriptive text is informative in
nature and may not be complete. The complete message format in ABNF
form is provided in Appendix A and is the normative format
definition.
5. MRCP Specification
The MRCP PDU is textual using an ISO 10646 character set in the UTF-8
encoding (RFC 3629 [12]) to allow many different languages to be
represented. However, to assist in compact representations, MRCP
also allows other character sets such as ISO 8859-1 to be used when
desired. The MRCP protocol headers and field names use only the
US-ASCII subset of UTF-8. Internationalization only applies to
certain fields like grammar, results, speech markup, etc., and not to
MRCP as a whole.
Lines are terminated by CRLF, but receivers SHOULD be prepared to
also interpret CR and LF by themselves as line terminators. Also,
some parameters in the PDU may contain binary data or a record
spanning multiple lines. Such fields have a length value associated
with the parameter, which indicates the number of octets immediately
following the parameter.
The whole MRCP PDU is encoded in the body of the session level
message as a MIME entity of type application/mrcp. The individual
MRCP messages do not have addressing information regarding which
resource the request/response are to/from. Instead, the MRCP message
relies on the header of the session level message carrying it to
deliver the request to the appropriate resource, or to figure out who
the response or event is from.
The MRCP message set consists of requests from the client to the
server, responses from the server to the client and asynchronous
events from the server to the client. All these messages consist of
a start-line, one or more header fields (also known as "headers"), an
empty line (i.e., a line with nothing preceding the CRLF) indicating
the end of the header fields, and an optional message body.
generic-message = start-line
message-header
CRLF
[ message-body ]
message-body = *OCTET
start-line = request-line / status-line / event-line
The message-body contains resource-specific and message-specific data
that needs to be carried between the client and server as a MIME
entity. The information contained here and the actual MIME-types
used to carry the data are specified later when addressing the
specific messages.
If a message contains data in the message body, the header fields
will contain content-headers indicating the MIME-type and encoding of
the data in the message body.
5.1. Request
An MRCP request consists of a Request line followed by zero or more
parameters as part of the message headers and an optional message
body containing data specific to the request message.
The Request message from a client to the server includes, within the
first line, the method to be applied, a method tag for that request,
and the version of protocol in use.
request-line = method-name SP request-id SP
mrcp-version CRLF
The request-id field is a unique identifier created by the client and
sent to the server. The server resource should use this identifier
in its response to this request. If the request does not complete
with the response, future asynchronous events associated with this
request MUST carry the request-id.
request-id = 1*DIGIT
The method-name field identifies the specific request that the client
is making to the server. Each resource supports a certain list of
requests or methods that can be issued to it, and will be addressed
in later sections.
method-name = synthesizer-method
/ recognizer-method
The mrcp-version field is the MRCP protocol version that is being
used by the client.
mrcp-version = "MRCP" "/" 1*DIGIT "." 1*DIGIT
5.2. Response
After receiving and interpreting the request message, the server
resource responds with an MRCP response message. It consists of a
status line optionally followed by a message body.
response-line = mrcp-version SP request-id SP status-code SP
request-state CRLF
The mrcp-version field used here is similar to the one used in the
Request Line and indicates the version of MRCP protocol running on
the server.
The request-id used in the response MUST match the one sent in the
corresponding request message.
The status-code field is a 3-digit code representing the success or
failure or other status of the request.
The request-state field indicates if the job initiated by the Request
is PENDING, IN-PROGRESS, or COMPLETE. The COMPLETE status means that
the Request was processed to completion and that there will be no
more events from that resource to the client with that request-id.
The PENDING status means that the job has been placed on a queue and
will be processed in first-in-first-out order. The IN-PROGRESS
status means that the request is being processed and is not yet
complete. A PENDING or IN-PROGRESS status indicates that further
Event messages will be delivered with that request-id.
request-state = "COMPLETE"
/ "IN-PROGRESS"
/ "PENDING"
5.2.1. Status Codes
The status codes are classified under the Success(2XX) codes and the
