Microsoft: Studio9
Studio9 transformed Microsoft's fragmented voice synthesis tool into an intuitive platform that democratized high-quality audio production for developers. This project bridged complex AI technology with accessible design, reducing production time by 40% while establishing the UX foundation for Azure's voice services.
Role | UX Designer, Research Collaboration
Company | Microsoft
Timeline | 2017
Focus Areas | Voice Technology UX, Enterprise Tool Design, Workflow Optimization
Impact | 40% reduction in tuning time, 65% improvement in user task completion rate, adopted by 200+ internal & external developers, became foundation for Azure Cognitive Services voice tooling
Project Overview
Studio9 represents an early intersection of AI and design tooling—a sophisticated text-to-speech tuning platform that transforms raw text into high-quality audio through SSML controls. While the underlying voice synthesis technology was powerful, the existing interface created significant friction for both internal Microsoft teams and third-party developers trying to produce professional-quality audio content.
This 3-month redesign transformed Studio9 from a fragmented, expert-only tool into an intuitive, scalable platform that democratized high-quality voice production. The project required deep collaboration with voice researchers and PMs to understand both the technical constraints and user workflows—experience that would later inform my approach to AI-assisted design tools.
The challenge wasn't just visual—it was systemic. How do you make complex voice tuning accessible without losing precision? How do you design for both 5-minute quick edits and 3-hour detailed production sessions? Most importantly, how do you create an interface that grows with Microsoft's expanding voice AI capabilities?
Background & Strategic Context
The Voice Technology Opportunity
In 2017, voice interfaces were rapidly gaining traction, but high-quality speech synthesis remained complex and time-intensive. Microsoft had developed sophisticated SSML-based voice models, but the tooling to harness them was fragmented and intimidating.
The Business Challenge:
Internal teams struggled with lengthy voice content production cycles
External developers faced steep learning curves, limiting platform adoption
Existing tools required expert-level SSML knowledge, creating bottlenecks
No unified workflow for iterative voice tuning and quality control
Target Users & Use Cases
Through stakeholder interviews and usage analytics, we identified three primary user segments:
1st Party Developers (60% of usage)
Quick prototyping for Cortana and Office voice features
Batch processing for accessibility features
Rapid iteration on voice UI concepts
3rd Party Developers (35% of usage)
Custom voice applications and experiences
Educational content and audiobook production
Voice branding and commercial audio projects
Voice Researchers (5% of usage)
Model testing and validation
Advanced SSML experimentation
Quality benchmarking across voice types
Each segment had distinct needs: speed vs. precision, guided workflows vs. expert controls, individual tasks vs. collaborative production.
Research
The existing Studio9 platform had been in use for three years, accumulating both user habits and systemic problems. Rather than immediately jumping to solutions, we invested significant time understanding the real friction points through multiple research methods:
Direct User Observation
Shadowed 8 frequent users during typical editing sessions
Recorded screen interactions to identify micro-inefficiencies
Mapped actual workflow patterns vs. intended user journeys
Comparative Analysis
Audited similar professional audio tools (Pro Tools, Audacity, Adobe Audition)
Analyzed Microsoft's own Office suite for interaction patterns and visual consistency
Studied emerging voice platforms for workflow innovations
Technical Constraint Mapping
Collaborated with voice engineering teams to understand SSML processing limitations
Identified opportunities where UX improvements could unlock technical capabilities
Documented performance bottlenecks that affected user experience
Problem Definition
① Flat Visual Hierarchy
All functions appeared equally important, making it difficult for users to understand workflow sequence or feature priority. Critical controls were visually identical to secondary options.
② Unclear Mental Model
Users couldn't predict what would happen when they made changes. The relationship between SSML parameters and audio output wasn't intuitive, leading to trial-and-error workflows.
③ Fragmented Information Architecture
Related functions were scattered across different areas. Users constantly switched contexts, losing track of their progress and breaking concentration during detailed tuning work.
④ Limited Scalability for Long Content
The interface was optimized for short phrases, but broke down when users needed to work with paragraphs or longer content. Real-world use cases demanded better content management.
Design Strategy & Approach
Reframing the Problem
Initially, stakeholders framed this as a "visual refresh" project. Through research, I reframed it as a workflow optimization challenge—the interface needed to match how people actually think about voice tuning, not just expose technical parameters.
This shift changed everything. Instead of reorganizing existing features, we redesigned the entire mental model around three core user activities:
Content Input & Structure - Getting text ready for voice processing
Voice Selection & Tuning - Choosing and customizing voice characteristics
Quality Control & Export - Testing, refining, and outputting final audio
Design Principles
Progressive Disclosure
Surface complexity only when needed. New users see streamlined workflows; expert users can access advanced controls without friction.
Context Preservation
Users should never lose track of their progress or have to rebuild mental context after switching between functions.
Immediate Feedback
Every adjustment should provide clear audio and visual feedback, making the relationship between controls and output transparent.
Flexible Workflow Support
The interface adapts to both quick 5-minute edits and detailed 3-hour production sessions without compromising either experience.
Design Solution & Key Features
Unified Workspace Architecture
The redesigned Studio9 organizes all functionality around a central editing canvas, with contextual panels that appear based on user actions. This eliminates the fragmented "popup hell" of the original interface.
Primary Editing Zone
Large, scrollable text area optimized for long-form content
Inline SSML controls that appear contextually as users select text segments
Real-time waveform visualization showing voice characteristics
Integrated playback controls with segment-specific preview
Contextual Control Panels
Voice Library Panel: Organized by language, gender, age with audio previews
Tuning Panel: Visual sliders and controls for pitch, speed, emphasis
Export Panel: Batch processing options and quality settings
Workflow-Driven Features
Smart Voice Recommendations Instead of showing all available voices, the system suggests appropriate options based on content analysis and user history. Advanced users can still access the full library.
Visual SSML Editing Complex SSML tags are represented as visual controls and inline annotations. Users can work visually while the system generates correct markup behind the scenes.
Batch Processing Pipeline For users working with multiple files or long content, we introduced queue-based processing with progress tracking and error handling.
Collaborative Review System Teams can share tuning sessions with stakeholders, collecting feedback directly on specific audio segments without requiring full Studio9 access.
Technical Innovation Through UX
Working closely with the voice engineering team, we identified several areas where better UX could unlock existing technical capabilities:
Predictive Caching The interface now pre-processes likely voice combinations, dramatically reducing wait times during iterative tuning.
Intelligent Defaults Instead of requiring users to configure every parameter, the system learns from successful tuning sessions and applies appropriate starting points for similar content.