Sarcasm-as-a-Service: Five Years Later

Rory Preddy, Author, Travelling

TLDR; Demo App https://aka.ms/ttsazure, Repo: https://github.com/roryp/ttsazure

I am 4 foot 1, and I have achondroplasia (dwarfism). Following spinal surgery, I was in a coma for three months and woke up unable to speak. I longed to tell my family that I loved them. That experience changed everything. A voice is not just sound; it is identity, emotion, and memory.

As a Developer Advocate at Microsoft, my role blends equal parts influencer and engineer, with a constant supply of questions from younger generations. When I first presented this talk in 2019, my career took off—audiences laughed at my demos, my jokes, and occasionally at me. Many assumed I was joking when I promised to revisit the project. Yet here I am, in a country where perpetual load shedding inspired this talk’s subtitle: “The power is out again. Fantastic.”

It has been five years since I first delivered my talk Sarcasm-as-a-Service and launched a demo that explored whether machines could be taught to use sarcasm. I would like to say that I now fully understand sarcasm; however, that would be untrue. What I have truly learned is that sarcasm remains messy, emotional, and deeply human. My task continues to be demonstrating how machines might help us experience and convey it.

Image, “Sarcasm is hard to detect”

Why Does Sarcasm Matter?

Since my first attempt at teaching machines sarcasm, I have spent considerable time working advocating AI Accessibility and reflecting on human communication. Sarcasm is far more than a source of cheap humor; it functions as an encrypted form of communication. Through word choice, tone, and context, sarcasm encodes meaning that only those attuned to it can fully appreciate.

Inclusive design, https://accessibility.blog.gov.uk

For individuals with disabilities—particularly those who struggle to read facial expressions or rely on assistive technologies—sarcasm can be one of the few expressive tools available. In my work, I often share the stories of Ron, an 82-year-old learning technology for the first time; Ashley, a developer who reads code with a screen reader; and Claudia, who depends on a screen magnifier. They represent just a fraction of the billion people worldwide living with disabilities. As a developer of tools that run on 95% of the world’s desktops, I owe them more than token gestures—I owe them technology that speaks their language, including its sarcasm.

Secret Service Sarcasm Detector: https://web.archive.org/web/20140604004533/https://www.fbo.gov/?s=opportunity&mode=form&id=8aaf9a50dd4558899b0df22abc31d30e&tab=core&_cview=0

Sarcasm is hard to grasp, even governments have grappled with sarcasm’s complexity. In 2014, the U.S. Secret Service sought a social-media analytics system capable of detecting sarcastic posts. The rationale was clear: misreading sarcasm in a threat could result in catastrophic consequences. If the federal government struggles to distinguish sarcasm from seriousness, imagine the challenge for a smart speaker.

Opportunities and Risks

Sarcasm can be powerful, but it is not always appropriate. In my daily work, I encounter many potential applications: injecting humor into developer advocacy posts, adding wit to customer support systems, designing code review bots with a playful tone, or bringing comic relief to training modules. Yet sarcasm is also easy to misuse. Machines without context can transform support interactions into unintentional insults.

Tay AI, https://en.wikipedia.org/wiki/Tay_(chatbot)

The most infamous example is Microsoft’s Tay, a Twitter bot that quickly began parroting offensive content. That episode highlighted the critical importance of ethical guardrails in large language models.

Revisiting the Original Demo

DeepMoji, https://www.media.mit.edu/projects/deepmoji/overview/

In 2019, I built a sarcastic bot using the tools of the time: vector space models, recurrent neural networks (RNNs), and the DeepMoji dataset. In DeepMoji uses a twitter database to pair with emojis to classify sarcasm. The results were more comedic than rigorous.

One enduring example was the phrase: “The electricity is off again. Oh joy!”—a sentiment instantly familiar to South Africans. The dataset paired such statements with eye-rolling emojis, but the machine never understood the underlying despair. Context was absent; sentiment was reduced to emoji counts.

Why is Sarcasm so hard to grasp? It’s because sarcasm is often confused with irony or insult. Irony occurs when outcomes are opposite to expectations—for instance, a billboard promising “Complete home repairs” placed on a derelict shack. Sarcasm is biting irony, derived from the Greek sarkasmós, meaning “to tear flesh” or “sneer.” It delivers harsh criticism cloaked in pleasant words.

This could be why machines struggle because sarcasm operates on multiple layers: the literal text, cultural and situational context, and tone. For example, if my wife suggests lunges to stay in shape and I reply, “That’s a big step forward,” the pun is obvious in text, but the sarcasm comes alive through tone—something a screen reader or machine cannot capture.

Prosody and Brain Chemistry

Prosody dimensions

Capturing sarcasm requires modeling prosody—the “music” of speech. Prosody includes pitch, duration, loudness, and timbre, which together shape intonation and stress. A slight rise in pitch may signal excitement, a pause may convey doubt, and timbre may reflect joy or sadness. These subtle cues help listeners decode sarcasm.

Why does this matter? Because laughter is not merely sound but chemistry. A well-timed sarcastic remark triggers a neurochemical response: reduced cortisol (stress), alongside dopamine, oxytocin, and endorphins that foster trust and bonding. Machines must therefore do more than imitate words; they must convincingly stimulate this human chemistry.

Recent Advances and the New Sarcastic Soundboard

Fast-forward to 2025 and the past five years have brought remarkable progress with large language models and text-to-speech systems. Models such as Azure Openai’s GPT-Audio can now generate expressive speech. Building on this, I developed a new sarcastic soundboard using Azure Container Apps👉https://aka.ms/ttsazure (see Appendix A for a technical deep dive)

The web app allows users to enter text, choose a “vibe” (Excited, Calm, Sad, Sarcastic), and select from premium voices. The backend, built with Spring Boot on Java 21, integrates with Azure openai’s GPT-Audio model. It supports multiple formats, streaming, and secure authentication through Azure Managed Identity. Safety features include rate limits, logging, and safeguards against misuse.

Conclusion

Microsoft Pilot with Brian Jeansonne, https://youtu.be/5FWwM1S8RfE

Are we there yet? The answer is both yes and no. Machines still lack true understanding; they miss cultural nuance, context, and the unspoken smirk. Today’s models can detect sarcasm, generate witty retorts, and mimic prosody convincingly enough to trigger human responses. They can restore voices once thought lost.

Microsoft’s pilot with Team Gleason helped Brian Jeansonne, living with ALS, explore what an Azure custom neural voice can do to bring more of his personality into everyday conversations. In one case, hearing Brian’s preserved voice say “Hey, Christy, I love you” moved me to tears. This is not merely technology for humor—it is technology for dignity, connection, and humanity.

As I conclude, the power goes out again—oh joy. Five years later, I remain cautiously optimistic. Machines may never truly be sarcastic, but they can certainly help us communicate and preserve what makes us human. If that isn’t fantastic, I don’t know what is.

Appendix A – Demo Deep dive

Lets dive into my demo 👉A modern, interactive text-to-speech application built with Spring Boot and Azure OpenAI’s GPT-Audio model. It can transform any text into natural-sounding speech with multiple voice options, customizable styles, and advanced tone guidance.

1) Azure Developer CLI (azd) foundation

First lets get everything ready and clone the GitHub repo:

git clone https://github.com/roryp/ttsazure.git
cd ttsazure

Next we’ll leverage Azure Developer CLI (azd) which orchestrates the demo apps end-to-end deployment from a single command:

azd up

azd reads the project’s azure.yaml and coordinates both infrastructure provisioning and app deployment.

2) Infrastructure provisioning (infra/main.bicep)

Digging deeper into azd we see it uses a single Bicep template to create all the cloud resources required by the demo:

// Core infrastructure components
resource managedIdentity 'Microsoft.ManagedIdentity/userAssignedIdentities@2023-01-31'
resource openAiAccount 'Microsoft.CognitiveServices/accounts@2024-10-01' 
resource containerRegistry 'Microsoft.ContainerRegistry/registries@2023-07-01'
resource containerAppsEnvironment 'Microsoft.App/managedEnvironments@2024-03-01'
resource containerApp 'Microsoft.App/containerApps@2024-03-01'

Key Infrastructure Components created:

  1. Azure OpenAI Service with the GPT-Audio model
  2. Azure Container Registry for secure image storage
  3. Azure Container Apps environment for serverless container hosting
  4. Managed Identity for keyless authentication
  5. RBAC Role Assignments for secure resource access

Region & security

  • Deploys to East US 2 for model availability.
  • Enforces role-based access control boundaries across all resources.

3) Zero-Secret Security Architecture

With the infrastructure in place, the application authenticates without API keys using Azure AD and Managed Identity. The service layer acquires an access token at runtime and forwards it to Azure OpenAI:

@Service
public class OpenAIService {
    private final TokenCredential credential;
    
    public OpenAIService() {
        this.credential = new DefaultAzureCredentialBuilder().build();
    }
    
    public byte[] generateSpeech(String text, String voice, String style, String format) {
        // Get access token for Azure Cognitive Services
        TokenRequestContext tokenRequestContext = new TokenRequestContext()
                .addScopes("https://cognitiveservices.azure.com/.default");
        AccessToken token = credential.getToken(tokenRequestContext).block();
        
        // Use token in Authorization header
        HttpRequest request = HttpRequest.newBuilder()
                .header("Authorization", "Bearer " + token.getToken())
                .build();
    }
}

4) Vibe System Architecture

On top of raw TTS, the demo layers expressive voice “vibes.” VibeService.java loads 12 predefined styles from vibes.json—including “Sarcastic.”

{
  "name": "Sarcastic",
  "description": "Voice Affect: Dry, pointed, subtly mocking; embody sarcasm and irony.\n\nTone: Sardonic, witty, slightly condescending; convey clever mockery.\n\nPacing: Normal with strategic emphasis on ironic points and subtle pauses.\n\nEmotion: Cleverly sarcastic; express wit and subtle mockery.",
  "script": "Oh sure, because that's exactly what everyone was thinking..."
}

5) Audio Generation Pipeline

From request to playback, the flow runs through a single controller – TtsController.java.

5.1 Input Validation and Rate Limiting

@PostMapping("/tts")
public String generateTts(@RequestParam String text, @RequestParam String voice, 
                         @RequestParam String style, @RequestParam String format) {
    // Validate input length (max 4,000 characters)
    if (text.length() > 4000) {
        text = text.substring(0, 4000);
    }
    
    // Rate limiting check (10 requests/minute, 100/hour, 50k characters/hour)
    if (!rateLimitService.isAllowed(clientId, text.length())) {
        // Return rate limit error
    }
}

5.2 Azure OpenAI synthesis

The OpenAIService.java handles the actual speech synthesis:

public byte[] generateSpeech(String text, String voice, String style, String format) {
    // Prepare request body
    Map<String, Object> requestBody = new HashMap<>();
    requestBody.put("model", "gpt-audio");
    requestBody.put("input", processTextForSpeech(text));
    requestBody.put("voice", voice); // 11 available voices
    requestBody.put("response_format", format); // mp3, wav, opus
    
    if (style != null && !style.trim().isEmpty()) {
        requestBody.put("instructions", style.trim()); // Vibe instructions
    }
    
    // Make API call to Azure OpenAI
   String url = String.format("%s/openai/v1/chat/completions", endpoint);
    
    HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create(url))
            .header("Authorization", "Bearer " + token.getToken())
            .header("Content-Type", "application/json")
            .POST(HttpRequest.BodyPublishers.ofString(requestBodyJson))
            .build();
    
    return httpClient.send(request, HttpResponse.BodyHandlers.ofByteArray()).body();
}

5.3. Audio Storage and Retrieval

Generated audio is stored in AudioStore.java

@GetMapping("/audio/{id}")
public ResponseEntity<byte[]> streamAudio(@PathVariable String id, 
                                         @RequestParam boolean download,
                                         @RequestParam String format) {
    byte[] audioData = audioStore.retrieve(id);
    
    HttpHeaders headers = new HttpHeaders();
    headers.setContentType(MediaType.valueOf("audio/mpeg")); // MP3 format
    headers.set("Accept-Ranges", "bytes");
    headers.set("Cache-Control", "no-cache, no-store, must-revalidate");
    
    if (download) {
        headers.setContentDispositionFormData("attachment", "audio_" + id + ".mp3");
    }
    
    return ResponseEntity.ok().headers(headers).body(audioData);
}

6) End-to-end: creating the sarcastic MP3

With the pipeline wired, generating a sarcastic clip is a straight path from UI to audio.

Screenshot of the demo

6.1 Frontend interaction (index.html)

  • Choose a voice: Alloy, Ash, Ballad, Coral, Echo, Fable, Nova, Onyx, Sage, Shimmer, Verse.
  • Pick a vibe: click Sarcastic to auto-populate style and sample text.
  • Edit text (≤ 4,000 characters).
  • Select format (MP3 by default).
  • Generate: click Generate Voice.

6.2 Backend Processing

In the backend, the TtsController.java now generates the sarcastic speech audio using the chosen style, stores it, and returns a unique audio ID to the frontend.

// TtsController processes the request
@PostMapping("/tts")
public String generateTts(@RequestParam String text, @RequestParam String voice, 
                         @RequestParam String style, @RequestParam String format) {
    
    // Generate speech with sarcastic style
    byte[] audioData = openAIService.generateSpeech(text, voice, 
        "Voice Affect: Dry, pointed, subtly mocking...", "mp3");
    
    // Store audio with unique ID
    String audioId = audioStore.store(audioData);
    
    // Return success with audio ID for frontend
    model.addAttribute("audioId", audioId);
    model.addAttribute("success", "Voice generated successfully!");
}

6.3 Playback & download

The index.html now renders audio controls with download capability:

<audio controls autoplay>
    <source th:src="@{'/audio/' + ${audioId} + '?format=mp3'}" type="audio/mpeg">
</audio>
<a th:href="@{'/audio/' + ${audioId} + '?download=true&format=mp3'}" 
   download="sarcastic-audio.mp3" class="download-btn">
   Download MP3
</a>

7) Rate Limiting and Performance

One of the concerns with AI models is abuse so I added rate limiting through RateLimitService.java:

  • 10 requests per minute per IP address
  • 100 requests per hour per IP address
  • 50,000 characters per hour per IP address
  • Configurable limits via environment variables
  • Graceful degradation with clear error messages

8) Container Apps Deployment Architecture

To hangle scaling I edited the main.bicep config to deploy the app to Azure Container Apps with health probes and autoscale:

resource containerApp 'Microsoft.App/containerApps@2024-03-01' = {
  properties: {
    template: {
      containers: [{
        name: 'ttsazure'
        resources: {
          cpu: json('2.0')
          memory: '4Gi'
        }
        probes: [
          {
            type: 'Liveness'
            httpGet: { path: '/health', port: 8080 }
          },
          {
            type: 'Readiness' 
            httpGet: { path: '/health', port: 8080 }
          }
        ]
      }]
      scale: {
        minReplicas: 1    // Always-on for instant response
        maxReplicas: 3    // Auto-scaling under load
        rules: [{
          name: 'http-scaling'
          http: { metadata: { concurrentRequests: '10' }}
        }]
      }
    }
  }
}

9) Monitoring and Observability

A lightweight health endpoint surfaces key runtime signals:

@GetMapping("/health")
@ResponseBody
public Map<String, Object> health() {
    return Map.of(
        "status", "UP",
        "audioStoreSize", audioStore.size(),
        "rateLimitConfig", rateLimitProperties,
        "timestamp", Instant.now()
    );
}

10) The Final Product: Sarcastic MP3

The result is a downloadable, expressive MP3—in this case, a perfectly dry sarcastic read—generated securely (no secrets), reliably (autoscaled), and repeatably (azd up). The included sarcastic.mp3 in the repository demonstrates the end-to-end system in action. Deploy it or give it a try at https://aka.ms/ttsazure

Total
0
Shares
Previous Post

BoxLang Dynamic JVM Language v1.9.0 Released

Next Post

Java 25: JEP 512, JBang, Notebooks, GraallPy and Raspberry Pi for Interactive Learning

Related Posts