Device Type: desktop
Skip to Main Content Skip to Main Content

Build Smarter Engagements with Audio Connector for Vonage Video API

This article was updated on August 15, 2023

Let's look at how Vonage Audio Connector enables raw audio access from live Vonage video sessions, unlocking numerous possibilities for enhanced customer experiences.

Image of a tablet with a doctor on screen and a consultation transcript with text bubbles on the right

What’s the Benefit of Accessing Raw Media?

In the post-pandemic era, organizations are increasingly focused on enhancing customer experiences through innovative technology. Customers will determine if they’ll do repeat business with your company based on their communications experiences. According to the latest Vonage Global Customer Engagement Report, 75% of customers won’t make a repeat purchase after a poor experience, while 58% will spread the word about good experiences to family and friends.

Increasingly, organizations are turning to artificial intelligence capabilities that promote greater customer engagement with features like live captions, transcriptions, translations, sentiment analysis, post-call summaries, and gaze and posture detection.

To embark on the AI journey, companies will first need to secure access to raw media, such as audio, video, or text. While obtaining this access can pose a challenge and require significant development efforts, it is crucial  to unlock the full potential of AI analysis and processing.


Simplifying Access to Media

At Vonage, we streamline media capture. We grant customers access to raw media and provide them with the option to send it to their preferred services for analysis or processing. Here are the three most common types of media you may want to access.

Audio Access: Post-call audio access can enable a variety of features such as transcription, translation, sentiment analysis, summary, Electronic Health Records (EHR), and media intelligence. Real-time audio access enables live captions, live translation, live sentiment analysis, noise suppression, and echo cancellation. 

Video Access: Post-call video access can enable video analysis for surveillance, video editing, education and training, and possible legal evidence. Real-time video access enables remote monitoring and live object/action detection. 

Text Access: Access to text media be used for Q&A or sharing links or documents during a conference call. This information can be stored or processed in real-time or after the call to build search and indexing and media intelligence. 


Introducing Vonage Audio Connector

Vonage Audio Connector enables customers to natively extract raw audio streams from live video sessions and send them to speech recognition services, such as AWS Transcribe, Google Speech to Text, and Azure Speech to Text. This enables real-time and offline processing of audio streams, allowing customers to build captions, transcriptions, translations, search and indexing, content moderation, media intelligence, Electronic Health Records (EHR), sentiment analysis, and more.

image depicting how Vonage Audio Connector takes audio streams from a Vonage video client through a vonage video media router to audio streams on the right hand of the image leading to a web socket server
Audio Connector illustration

What is Unique About the Audio Connector Native Audio Capture?

Vonage customers can already get access to raw audio from WebRTC clients. Here is a blogp ost explaining how to build Live Captions using and capturing the audio from the client side. With Audio Connector, you can now capture audio from the media router itself. 

What’s the advantage of accessing raw audio from the server side instead of client side? We’re glad you asked! Client side audio capture needs development effort on all platforms, but server side audio capture means developer efforts are only needed on that end.. The audio stream also needs to be sent to the CPaaS servers and to the audio processing service, which uses twice the bandwidth. Server side audio capture reduces the burden on client side by forwarding the audio from the server side, and it avoids using extra audio bandwidth on the client side. 

Client side solutions also can’t capture audio from SIP dial-in participants, but server side audio capture can collect that media. 


Product Features

Client Side Audio Capture

Server Side Audio Capture

Device support

Development effort needed for each device platform (SDK)

Native support for all devices 

Audio traffic handling

Audio stream is created on the client side and is sent to the CPaaS servers AND the audio processing service; this results in an increased burden on the client

Audio stream is created on the client side and is sent to the CPaaS servers; servers forward to the audio processing service

Bandwidth usage



Working with firewalls

Challenging; third-party service might be blocked on the client side

Third-party service are connected from the server side

Cost of ownership

Cost of the audio processing service

Cost of the audio processing service and server utilization

Using Audio Connector

With Audio Connector, Vonage customers simply set a WebSocket URL for the video session and decide if they want to send a single audio stream per WebSocket or multiple audio streams per WebSocket. 

The single stream allows the customer application to identify the speaker. This is important in healthcare conversations when it’s necessary to differentiate between doctor and patient audio. If identifying the speaker is not a concern, customers can send multiple streams per WebSocket connection.  

Once you have Audio Connector, your application will be able to:

  • Set a preferred WS(S) URL to send the audio streams 
  • Set single or multiple streams per WS
  • Identify the speakers (with single stream only)

Once the stream is captured, your application can then connect to any third-party conversational AI provider. One of the major providers can fulfill most use cases, but smaller providers can also serve specialty use cases.  


Audio Connector Use Cases

Audio Connector has a wide variety of use cases. Here’s some representative examples.

Use Case


Live captions

Provide automated live captions in a video call


Create a live/offline record of the conversations

Electronic Health Records (EHR)

Build EHRs based on doctor’s speech


Live/offline translations for accessibility and comprehension

Search & index

Save keywords for indexing and searching the content

Content moderation

Control conversations for obscene / unacceptable content

Media intelligence

Extract important action points or summarize meetings

Sentiment analysis

Live/offline analysis of the reactions of the speakers

Get started with Vonage Audio Connector

Vonage Audio Connector enables raw audio access from live Vonage video sessions, unlocking numerous possibilities for enhanced customer experiences. With the Audio Connector, you level up your customer interactions while reducing developer workload.

Try Vonage Audio Connector


Binoy Chemmagate
By Binoy Chemmagate Senior Product Manager, Vonage Video API

Binoy is a Senior Product Manager for Vonage Video API, with over 10 years of Telecom and real-time communication experience as an Engineer, Researcher and Product Manager. Binoy is leading the AI initiatives and works towards bringing AI insights in every customer interaction. He lives in Helsinki (Finland) and loves to do product mentoring in his free time.

Deskphone with Vonage logo

Speak with an expert.

US toll-free number: 1-844-365-9460
Outside the US: Local Numbers