Video showing the IBM Cloud AI Assistant user experience in side-panel design component.

Measuring trust in AI

foundational · tactical

TL;DR: TBD

Product: IBM Cloud Documentation

Role: Lead UX Researcher

Topics: AI Assistants; RAG; AI Evaluation; Trust in AI

Methods: Desk Research, Usability Tests

Impact & Influence

70%+ reductions in user task completion times.
10 usability issues discovered prior to launch
Helped get executive buy-in for project
Developed user trust measurement for AI tools

In late 2023, IBM Cloud's Documentation (Docs) team proposed building a platform-wide AI Assistant (AIA). The potential was huge: a successful AIA could drastically decease user task completion times in the docs space and lay the foundations for more advanced, agentic experiences. But as the first GenAI experience on the platform, the stakes were high.
If the AIA’s answers were untrustworthy it could hurt not just usage and adoption of this specific AI tool, but the adoption of all AI-infused experiences across IBM Cloud.
Recognizing the potential and stakes, the lead PM brought me in to lead foundational and tactical research. My goal was to define what a useful, usable, and, most importantly, trustworthy AI experience should be for our users.

GenAI behaves differently from non-AI tools. With traditional software, barring any errors, it will behave deterministically. Pressing “Submit” a hundred times and it repeatedly triggers the same hard-coded action in the system.
On a fundamental level, GenAI doesn’t work like this. It’s probabilistic, not deterministic. Each time a GenAI system is asked to do something, the ultimate output will be the result of statistical number crunching happening within the system. Because of this probabilistic nature, trust in the AI system’s abilities and intentions shapes usage.
Therefore, I needed to measure not only the AI Assistant's usability but also its trustworthiness. The challenge was to find a measurement that was (a) lightweight enough to use alongside standard usability scales and (b) nuanced enough to capture the multifaceted nature of trust.
Understand Human-AI Interactions (HAX), define mental models, usage patterns, enterprise needs and expectations for GenAI, and principles for building them.
- Key Question: How do people think about and use AI tools?
- Key Questions: How do they want to use them?
- Key Question: How do risk and other factors influence usage of AI tools?
- Key Question: What are enterprises needs for GenAI?
Evaluate (test) what we built and refine
- Key Question: Can users use the new IBM Cloud AI Assistant as intended—i.e., to answer “What-Is” and completing “How-To” questions.
- Key Question: Do IBM Cloud users’ mental models and usage patterns map to foundational research findings?

My research was carried out in two phases:
1. A discovery phase to lay foundational understanding of GenAI, UX for AI, and GenAI enterprise needs.
2. Evaluation of AI Assistant prototype.
Method: Desk Research
- Why: To better understand domain and lay foundational understanding to better evaluate IBM Cloud’s AI Assistant. (OBJ 1)
- What: Sources reviewed and triangulated included public and proprietary surveys (e.g., Gallup and C-Suite reports), academic papers, industry and thought-leader blogs on GenAI.
KEY INSIGHT: Everything Hinges on Trust (OBJ 1)
In a domain that is marked by volatility, the one constant we can be sure of is the central role trust plays in GenAI adoption and usage.
Too little trust in the AI, users miss out on its potential benefits; too much trust, users over-rely on the AI—e.g., blindly accepting the AI’s output as truth without doubly checking it—resulting in potential negative effects. Teams building GenAI solutions need to strike a middle ground or “calibrated trust.”
Method: Usability Test
- Why: Once the lead developer had a working prototype, I tested the solution in a dev environment of the product to see if users could use the AI Assistant as intended—i.e., to answer “What-Is” questions and to complete “How-To” questions. (OBJ)
- Who: 8 cloud users of varying technical backgrounds (e.g., Admins and Developers)
  What: Remote, 60-minute think-aloud sessions where participants were asked to complete tasks using the Cloud’s AI Assistant.
KEY INSIGHT: From Bot to Butler: Cloud’s AI Assistant must follow users around until dismissed
A key finding in this research was participants couldn’t complete “How-To” questions—a critical usability issue that prevented intended use.
The reason for this was that the AI Assistant, which opened in a side panel, would close as soon as the participants navigated away from the starting page. This meant they either (a) had to remember all the steps (b) reopen the AI Assistant and ask the question again because there wasn’t a session history (another usability issue) or (c) give up on completing the task.
Completing tasks often required users to traverse multiple pages within the IBM Cloud platform. The AI Assistant needs to be omnipresent while the user is trying to complete tasks, showing the steps as they do them. It should only close if the user dismisses it much like someone would dismiss a butler.

From this work, I…
- Learned how to move fast in ambiguous spaces like agentic AI and AI SaaS.
- Honed communicating and managing relationships with different kinds of stakeholders, from ICs to leads to directors to executives.
- Learned that some research suggestions, even if not initially adopted or accepted, can grow over time, for example, by continuing to provide supporting evidence or examples post-study.

Measuring trust in AI

foundational · tactical

Impact & Influence

Context

Problem

Objective I

Objective II

Methods Overview

Phase 1: Discover

Phase II: Evaluate

Postmortem