Natural Language Voice Ordering

Every QSR chain in America is replacing humans with kiosks. They’re discovering a problem they didn’t expect: kiosks are slower.

View the pitch

DraftPOS KioskPhase 1

Version 0.1

Date April 2026

Owner Product

Why Current Kiosks Fail

A skilled cashier interprets “cheeseburger, no onions, add bacon” in under five seconds. A kiosk makes the customer find Burgers → Cheeseburger → Customize → Toppings → uncheck Onions → Add-ons → Bacon. That’s a navigation problem masquerading as a technology solution.

Kiosk menus are organized around restaurant operations—item categories, modifier trees, upsell flows—not around how customers actually order. Customers speak in natural shorthand. They reference items by nickname, bundle customizations into a single phrase, and expect the system to resolve ambiguity the way a person would.

The result: kiosk transactions take longer than counter transactions for any order above baseline complexity. This undercuts the core value proposition of kiosk deployment (throughput, labor cost reduction) and degrades customer satisfaction.

Root Causes

Menu taxonomy reflects kitchen logic, not customer mental models
Modifier selection requires multiple taps per customization
No recovery path for customers who can’t find an item
Upsell prompts interrupt flow rather than integrate naturally
No channel for customers to express a complete order at once

02 — OPPORTUNITY

The Hypothesis

An optional natural language voice interface—layered on top of the existing touch UI—lets customers order the way they talk. The system handles disambiguation through guided dialogue, not menu navigation. Customers who prefer touch keep it. Customers who want speed get it.

~40%Est. reduction in order time for complex orders

3–5×More modifiers captured per voice order vs. touch

↑ ATVNatural upsell via conversational prompt

These are directional targets pending pilot data. The primary success metric is order completion time, not revenue—if voice ordering isn’t faster, the feature has failed its core purpose. 03 — SCOPE

Goals & Non-Goals

✓ In Scope

Voice-initiated order entry at kiosk
NLP resolution of item names, nicknames, and combos
Guided dialogue for required modifiers (size, protein, etc.)
Inline upsell via conversational prompt
Graceful fallback to touch UI at any point
Order confirmation screen before payment
Accessibility mode (optional voice for customers with motor impairments)

✗ Out of Scope (v1)

Multi-language support beyond English
Loyalty program integration via voice
Payment via voice
Drive-thru speaker integration
Mobile app or web ordering
Custom wake word training per brand
Real-time inventory awareness

04 — USER STORIES

Core Use Cases

As a…	I want to…	So that…	Priority
Customer	Say my full order at once	I don’t navigate menus at all	P1
Customer	Use item nicknames (“large fry,” “double double”)	I order the way I actually talk	P1
Customer	Be prompted only for missing required info	The system doesn’t ask what I already said	P1
Customer	Switch to touch at any point	I’m not locked into voice if it’s not working	P1
Customer	Hear a natural upsell prompt	I can add items without restarting	P2
Operator	Configure item aliases and nicknames	Voice matches our brand language	P2
Operator	Disable voice mode per kiosk	High-noise environments can opt out	P2
Operator	Review voice order transcripts and error rates	I can identify failure patterns and improve	P3

05 — FUNCTIONAL REQUIREMENTS

Feature Specification

5.1 Activation

Voice mode is opt-in; initiated by tapping a microphone button on the home screen
System greets with a single short prompt: “What can I get for you?”
Wake-word activation (e.g., “Hey [brand]”) is a v2 consideration

5.2 Intent Resolution

NLP layer resolves spoken input to menu items using semantic matching (not keyword matching)
Handles item aliases, common nicknames, and partial matches
Handles bundle orders (“number 3” or combo names)
Confidence threshold: items below threshold surface a disambiguation prompt, not an error
Disambiguation prompt offers 2–3 options max, displayed on screen and spoken

5.3 Guided Modifier Collection

System tracks which required modifiers are missing after initial utterance
Asks for missing info in a single compound question where possible: “What size, and would you like that with cheese?”
Never re-asks for information already provided
Optional modifiers are not requested unless customer signals interest

5.4 Upsell Integration

One upsell prompt per transaction, offered after order is complete but before confirmation
Prompt is contextual: based on order content, not rotation logic (“Want to add a drink to that?”)
Customer can accept via voice or decline; no friction either way

5.5 Confirmation & Handoff

System reads back complete order before confirmation
Customer confirms via voice (“Yes,” “That’s right”) or taps confirm on screen
Confirmed order passes to existing POS cart; payment flow is unchanged
Customer can request corrections before confirmation; system updates incrementally

5.6 Fallback & Error Handling

After two failed resolution attempts on any utterance, system surfaces touch UI for that item
Voice session remains active for remaining items
If microphone input drops or times out, system defaults to touch mode and saves current cart state
All failures are logged with session ID for analytics

06 — UX PRINCIPLES

Design Constraints

The voice interface should feel faster than touch, not more clever. Every design decision is tested against one question: does this make ordering faster for the customer?

No menus in voice mode. The customer should never hear “say 1 for Burgers, say 2 for Sandwiches.”
One question at a time. Compound questions only when modifiers are closely related.
Visible progress. Screen shows cart building in real time as voice is processed.
Silence is not failure. Brief pauses don’t terminate the session; system waits for natural turn completion.
Zero dead ends. Any failure state has a graceful exit back to touch without losing cart.
Latency budget: 1.5s. Response from end of customer utterance to system reply must be under 1.5 seconds. Above 2s feels broken.

07 — TECHNICAL REQUIREMENTS

System Constraints

Requirement	Specification
Speech-to-Text	On-device preferred for latency; cloud fallback acceptable. Must handle ambient restaurant noise (65–80 dB environments).
NLP Engine	LLM-backed intent + entity extraction, fine-tuned on menu corpus. Menu data synced from POS at publish time.
Latency	End-to-end voice response <1.5s P95. Touch-to-voice handoff <200ms.
Offline Mode	Core ordering must function if cloud NLP is unreachable; degrade to touch gracefully.
Audio Hardware	Directional microphone array required. Speaker for TTS response. Noise cancellation mandatory.
Privacy	Audio not persisted beyond session. Transcripts anonymized before logging. No biometric data captured.
Accessibility	WCAG 2.1 AA for visual elements. Voice modality as accessibility enhancement, not replacement.
POS Integration	Voice-built cart passes to existing cart/payment API unchanged. No new payment flows in v1.

08 — SUCCESS METRICS

How We’ll Know It’s Working

Primary

Order completion time — voice vs. touch, controlled for order complexity. Target: voice ≤ touch at baseline; voice < touch for orders with 2+ modifiers.
Voice completion rate — % of voice sessions that result in a confirmed order without abandonment. Target: ≥75% in month 1, ≥85% by month 3.

Secondary

Intent resolution accuracy (% of utterances resolved on first attempt). Target: ≥88%.
Modifier capture rate — voice vs. touch. Hypothesis: voice captures more optional modifiers.
Average transaction value, voice vs. touch cohorts.
Customer satisfaction score delta (voice vs. touch session, via post-order prompt).

Kill Switch Criteria

Voice completion rate <60% at 30-day pilot review → pause and reassess
P95 latency >2.5s in production → revert to touch-only pending fix
Any privacy/data incident → immediate suspension pending audit

09 — PHASING

Rollout Plan

Phase	Scope	Duration	Gate
Alpha	Internal testing; simulated ordering with staff. Hardware validation.	4 weeks	Latency <1.5s P95; accuracy ≥85% on test corpus
Pilot	3–5 live locations. Limited daypart (lunch peak). Voice opt-in only.	6 weeks	Completion rate ≥75%; no P1 incidents
Limited GA	25% of estate. Full daypart coverage. A/B test voice vs. touch default.	8 weeks	Completion rate ≥82%; ATV neutral or positive
Full GA	Full rollout. Operator dashboard live. Multi-language scoping begins.	Ongoing	Sustained completion rate ≥85%

10 — RISKS & MITIGATIONS

Known Risks

Risk	Likelihood	Impact	Mitigation
Ambient noise degrades ASR accuracy	High	High	Directional mic array; noise floor testing required in Alpha; location-level disable flag
Customers distrust or avoid voice mode	Medium	Medium	Touch always available; voice is opt-in; staff briefed on coaching customers
Menu alias coverage gaps cause frequent disambiguation	Medium	High	Operator alias configuration tool; auto-learn common failed utterances post-pilot
Latency exceeds budget on cloud NLP path	Medium	High	On-device NLP for common items; cloud for edge cases; hard 2s timeout with touch fallback
Accessibility misuse (voice replaces touch for customers who need touch)	Low	Medium	Touch always present; voice never replaces, only augments

11 — OPEN QUESTIONS

Decisions Pending

Wake word vs. button activation. Button is safer for v1, but wake word creates a more natural experience. Decision needed before Alpha hardware spec is locked.
TTS voice selection. Brand-consistent synthetic voice vs. neutral. Requires brand/marketing sign-off.
Transcript retention policy. Legal/privacy review needed. Current assumption: zero retention beyond session.
Operator alias tooling. Build custom CMS or surface through existing menu management system? Engineering scoping required.
A/B test design for pilot. Voice opt-in vs. voice-default? Need agreement with analytics team before pilot launch.
Multi-language sequencing. Spanish is the obvious v2 target. Scoping timeline TBD post-GA data.

PRD — Natural Language Voice Ordering — v0.1 DRAFT April 2026 · Confidential