Microsoft: Studio9

Studio9 transformed Microsoft's fragmented voice synthesis tool into an intuitive platform that democratized high-quality audio production for developers. This project bridged complex AI technology with accessible design, reducing production time by 40% while establishing the UX foundation for Azure's voice services.

 

Role | UX Designer, Research Collaboration

Company | Microsoft

Timeline | 2017

Focus Areas | Voice Technology UX, Enterprise Tool Design, Workflow Optimization

Impact | 40% reduction in tuning time, 65% improvement in user task completion rate, adopted by 200+ internal & external developers, became foundation for Azure Cognitive Services voice tooling

 

Project Overview

Studio9 represents an early intersection of AI and design tooling—a sophisticated text-to-speech tuning platform that transforms raw text into high-quality audio through SSML controls. While the underlying voice synthesis technology was powerful, the existing interface created significant friction for both internal Microsoft teams and third-party developers trying to produce professional-quality audio content.

This 3-month redesign transformed Studio9 from a fragmented, expert-only tool into an intuitive, scalable platform that democratized high-quality voice production. The project required deep collaboration with voice researchers and PMs to understand both the technical constraints and user workflows—experience that would later inform my approach to AI-assisted design tools.

The challenge wasn't just visual—it was systemic. How do you make complex voice tuning accessible without losing precision? How do you design for both 5-minute quick edits and 3-hour detailed production sessions? Most importantly, how do you create an interface that grows with Microsoft's expanding voice AI capabilities?

 

Background & Strategic Context

The Voice Technology Opportunity

In 2017, voice interfaces were rapidly gaining traction, but high-quality speech synthesis remained complex and time-intensive. Microsoft had developed sophisticated SSML-based voice models, but the tooling to harness them was fragmented and intimidating.

The Business Challenge:

  • Internal teams struggled with lengthy voice content production cycles

  • External developers faced steep learning curves, limiting platform adoption

  • Existing tools required expert-level SSML knowledge, creating bottlenecks

  • No unified workflow for iterative voice tuning and quality control

Target Users & Use Cases

Through stakeholder interviews and usage analytics, we identified three primary user segments:

1st Party Developers (60% of usage)

  • Quick prototyping for Cortana and Office voice features

  • Batch processing for accessibility features

  • Rapid iteration on voice UI concepts

3rd Party Developers (35% of usage)

  • Custom voice applications and experiences

  • Educational content and audiobook production

  • Voice branding and commercial audio projects

Voice Researchers (5% of usage)

  • Model testing and validation

  • Advanced SSML experimentation

  • Quality benchmarking across voice types

Each segment had distinct needs: speed vs. precision, guided workflows vs. expert controls, individual tasks vs. collaborative production.

 

Research

The existing Studio9 platform had been in use for three years, accumulating both user habits and systemic problems. Rather than immediately jumping to solutions, we invested significant time understanding the real friction points through multiple research methods:

Direct User Observation

  • Shadowed 8 frequent users during typical editing sessions

  • Recorded screen interactions to identify micro-inefficiencies

  • Mapped actual workflow patterns vs. intended user journeys

Comparative Analysis

  • Audited similar professional audio tools (Pro Tools, Audacity, Adobe Audition)

  • Analyzed Microsoft's own Office suite for interaction patterns and visual consistency

  • Studied emerging voice platforms for workflow innovations

Technical Constraint Mapping

  • Collaborated with voice engineering teams to understand SSML processing limitations

  • Identified opportunities where UX improvements could unlock technical capabilities

  • Documented performance bottlenecks that affected user experience

 

Problem Definition

① Flat Visual Hierarchy
All functions appeared equally important, making it difficult for users to understand workflow sequence or feature priority. Critical controls were visually identical to secondary options.

② Unclear Mental Model
Users couldn't predict what would happen when they made changes. The relationship between SSML parameters and audio output wasn't intuitive, leading to trial-and-error workflows.

③ Fragmented Information Architecture
Related functions were scattered across different areas. Users constantly switched contexts, losing track of their progress and breaking concentration during detailed tuning work.

④ Limited Scalability for Long Content
The interface was optimized for short phrases, but broke down when users needed to work with paragraphs or longer content. Real-world use cases demanded better content management.

 

Design Strategy & Approach

Reframing the Problem

Initially, stakeholders framed this as a "visual refresh" project. Through research, I reframed it as a workflow optimization challenge—the interface needed to match how people actually think about voice tuning, not just expose technical parameters.

This shift changed everything. Instead of reorganizing existing features, we redesigned the entire mental model around three core user activities:

  1. Content Input & Structure - Getting text ready for voice processing

  2. Voice Selection & Tuning - Choosing and customizing voice characteristics

  3. Quality Control & Export - Testing, refining, and outputting final audio

Design Principles

Progressive Disclosure
Surface complexity only when needed. New users see streamlined workflows; expert users can access advanced controls without friction.

Context Preservation
Users should never lose track of their progress or have to rebuild mental context after switching between functions.

Immediate Feedback
Every adjustment should provide clear audio and visual feedback, making the relationship between controls and output transparent.

Flexible Workflow Support
The interface adapts to both quick 5-minute edits and detailed 3-hour production sessions without compromising either experience.

 

Design Solution & Key Features

Unified Workspace Architecture

The redesigned Studio9 organizes all functionality around a central editing canvas, with contextual panels that appear based on user actions. This eliminates the fragmented "popup hell" of the original interface.

Primary Editing Zone

  • Large, scrollable text area optimized for long-form content

  • Inline SSML controls that appear contextually as users select text segments

  • Real-time waveform visualization showing voice characteristics

  • Integrated playback controls with segment-specific preview

Contextual Control Panels

  • Voice Library Panel: Organized by language, gender, age with audio previews

  • Tuning Panel: Visual sliders and controls for pitch, speed, emphasis

  • Export Panel: Batch processing options and quality settings

Workflow-Driven Features

Smart Voice Recommendations Instead of showing all available voices, the system suggests appropriate options based on content analysis and user history. Advanced users can still access the full library.

Visual SSML Editing Complex SSML tags are represented as visual controls and inline annotations. Users can work visually while the system generates correct markup behind the scenes.

Batch Processing Pipeline For users working with multiple files or long content, we introduced queue-based processing with progress tracking and error handling.

Collaborative Review System Teams can share tuning sessions with stakeholders, collecting feedback directly on specific audio segments without requiring full Studio9 access.

Technical Innovation Through UX

Working closely with the voice engineering team, we identified several areas where better UX could unlock existing technical capabilities:

Predictive Caching The interface now pre-processes likely voice combinations, dramatically reducing wait times during iterative tuning.

Intelligent Defaults Instead of requiring users to configure every parameter, the system learns from successful tuning sessions and applies appropriate starting points for similar content.

Previous
Previous

Sina Weibo: Account Dashboard