How Forward-Deployed Engineers Run Customer Discovery at AI Companies in 2026

13 min read

How Forward-Deployed Engineers Run Customer Discovery at AI Companies in 2026

TL;DR

Forward-deployed engineers (FDEs) at Anthropic, OpenAI, Palantir, Databricks, and Cohere run customer discovery as a core part of the job — not a hand-off to product managers. The FDE discovery loop is a four-week cycle: Week 0 stakeholder map and eval-customer split, Week 1-2 async customer interviews at scale, Week 3-4 eval co-design with customer subject-matter experts, and a continuous post-launch feedback layer used as a retention mechanism. The under-documented half of the FDE job is conversational research: surfacing latent requirements the executive sponsor doesn't know to articulate. A 2024 a16z survey of 70+ enterprise AI buyers reported enterprise generative-AI spend tripled year over year while buyer confidence in vendor discovery dropped — the gap is now closed by FDEs running structured customer research alongside code. This guide covers what to ask end-users vs admins vs executive sponsors, how to design evals from interview data, five FDE discovery anti-patterns, and the 2026 tool stack — including Perspective AI as the conversational research layer.

Why FDEs Run Customer Discovery (and Why the Customer's PM Can't)

FDEs run customer discovery because the customer's own PM rarely has the technical depth to translate fuzzy executive sponsor goals into eval-ready engineering tasks. The forward-deployed model — pioneered at Palantir, copied by Anthropic, OpenAI, Databricks, and Cohere — exists because shipping enterprise AI without embedded technical discovery produces a demo that works in a deck and a pilot that dies in production. The customer's PM owns roadmap politics; the FDE owns finding out what the system actually has to do.

The executive sponsor signs the contract on a vision ("automate underwriting", "deflect 60% of tier-1 tickets"). Their PM translates it into Jira tickets. Both miss the latent requirements — regulatory edge cases, legacy CSV quirks, team norms like "we never auto-decline a claim under $2k" — that determine whether the system survives launch. The FDE finds those. For background see why the solutions engineer role is being replaced by forward-deployed AI engineers and Palantir's forward-deployed engineering playbook Anthropic and OpenAI are copying.

LLM applications also fail in ways traditional SaaS does not: hallucinations on uncommon prompts, retrieval misses on poorly-indexed corpora, refusals on legitimate enterprise jargon. None of that surfaces in a slide deck — it surfaces in interviews with the analyst who uses the system 40 hours a week. Skipping that step is the most common reason enterprise AI pilots stall, a pattern documented in our 2026 AI customer interview report covering 500 hours of moderated sessions.

Week 0: The Stakeholder Map and the Eval-Customer Split

Week 0 starts before any code is written and produces two deliverables: a stakeholder map and an eval-customer split. The map names every human who can kill, delay, or expand the deployment. The split decides which of those humans the system will be evaluated by (eval owners) and which it will be evaluated for (end customers whose workflow it transforms).

Build the stakeholder map with five named slots:

  • Executive sponsor — VP or C-level who signed the SOW. Owns budget and political narrative.
  • Eval owner — senior IC who will judge whether outputs are "good." Often a domain expert: senior underwriter, tenured analyst, head of legal ops.
  • End users — people who will run the system 5+ hours a week. Frequently not the eval owner.
  • Admin / IT counterpart — responsible for data access, SSO, audit logs, deployment.
  • Skeptic — the named human inside the customer org who thinks this project will fail. Talking to them early is the highest-leverage hour of Week 0.

Make the eval-customer split explicit: the eval owner co-designs the rubric (Week 3-4), end users generate the interview corpus (Week 1-2), and the skeptic gets the same rigor — not less. As Databricks' FDE team running customer research across its $62B data lakehouse footprint shows, the eval owner is rarely the loudest voice in kickoff. FDEs who only listen to the executive sponsor build for the wrong rubric.

Week 1-2: Async Customer Interviews at Scale

In weeks 1-2, the FDE runs structured async interviews with 15–40 end users to surface real workflow texture. Scheduling 40 synchronous 45-minute calls in two weeks is a non-starter inside most enterprise customers, and a form returns flat, schema-coerced data that misses the "it depends" cases AI systems live or die on.

The high-leverage move is async, AI-moderated customer interviews. Tools like Perspective AI let one FDE send an interview link to every end user, get conversational depth on every response, and have synthesis auto-extracted into themes — without scheduling a calendar event. Each interview behaves like a 20-minute 1:1 with an experienced researcher: the AI follows up on vague answers, asks "why" and "can you give me an example," and adapts the next question to what the respondent already said. We unpack this in the playbook for AI-moderated customer interviews and the broader shift in the discovery call is dead — what AI conversations replaced it with.

The interview structure FDEs use in Week 1-2 has four parts:

  1. Current workflow walkthrough — "Walk me through the last time you did X." Specific, recent, concrete. Never hypothetical.
  2. Frustration mining — "What's the part of this you'd pay someone to do for you?" Surfaces high-willingness-to-pay automation candidates.
  3. Edge cases and exceptions — "When does the standard process not apply?" This is where latent eval requirements live.
  4. Tool stack and constraint mapping — "What systems do you touch in a typical day? Which can't change?"

Run with 15–40 end users. Variance matters more than volume — recruit across tenure, seniority, geography, and team. The analytical playbook is in our guide to customer feedback analysis as an AI-first workflow that cuts synthesis from weeks to hours.

Week 3-4: Eval Co-Design with Customer SMEs

Weeks 3-4 are when interview data becomes evals. The FDE turns Week 1-2 themes, edge cases, and exception patterns into a structured evaluation rubric co-designed with the eval owner. Eval co-design distinguishes an FDE deployment from a generic pilot — and it is the step most teams skip.

The eval set has three layers:

  • Golden examples — 40–80 input/output pairs the eval owner explicitly approves. Used for offline regression testing.
  • Edge-case set — 20–50 inputs derived from interview themes (the "weird Tuesday" cases). Requirements the executive sponsor didn't know to mention.
  • Adversarial set — 10–20 prompts designed to break the system. Often built with the skeptic from the stakeholder map.

Co-design means the customer's SME literally sits with the FDE — async or sync — and labels examples together. Anthropic's guidance on building evals for production LLM applications emphasizes the same pattern: domain experts owning the rubric beats engineers guessing. The output is a CI-runnable eval suite and a human-rated benchmark, both grounded in the interview corpus. For how customer-side SMEs participate without becoming a bottleneck, see feature prioritization using AI customer research to rank the roadmap.

Post-Launch: Continuous Feedback Loops as Retention

Post-launch is where most FDE deployments quietly decay; the best teams treat continuous customer feedback as a retention mechanism. Three feedback loops run in parallel after go-live:

  1. In-product micro-interviews — short conversational prompts triggered after specific actions (a human override of an AI decision, a thumbs-down, the third retry on the same query). Async, opt-in, conversational — not a 5-star widget.
  2. Monthly cohort interviews — re-interview a rolling sample of the original Week 1-2 end users to detect workflow drift, new edge cases, and emerging skepticism.
  3. Eval owner debriefs — quarterly 30-minute syncs with the eval owner to re-rate a sample of production outputs and re-baseline the rubric.

Continuous discovery isn't new — Teresa Torres has been writing about it for years — but FDE teams are operationalizing it in 2026 in a way most PM orgs never managed, because the same conversational infrastructure that powers Week 1-2 also powers post-launch micro-interviews. See how Teresa Torres' continuous discovery framework operationalizes with AI conversations and our 2026 continuous discovery report on always-on research.

What to Ask End-Users vs Admins vs Executive Sponsors

Different stakeholders answer different questions; asking a sponsor about CSV quirks is as useless as asking an end user about budget. Cheat sheet:

AudienceWhat to askWhat NOT to ask
Executive sponsorStrategic outcomes, success metrics, political constraints, who else is invested in this working or failing"How does the current workflow work?" — they don't know in enough detail
End usersLast-time-I-did-X walkthroughs, exceptions, frustrations, tool stack, what makes a "good" output to them"What ROI do you expect?" — not their job
Eval owner / SMEWhat separates a great output from a passable one, where junior people get it wrong, edge cases that matterDon't ask them to grade hypothetical outputs — give them real ones
Admin / ITData access patterns, audit requirements, SSO, deployment topology, retention policiesWorkflow questions — they don't use the product
SkepticWhat would have to be true for this to actually work? Where have similar projects failed inside this org?Don't try to convert them — extract their failure model

Asking everyone the same script is a common mistake — scripts should be persona-specific. Our stakeholder interview template and user research interview template are the right shapes for Week 0 and Week 1-2.

Five FDE Discovery Anti-Patterns (and How to Avoid Them)

Five anti-patterns show up in nearly every failing FDE deployment.

  1. Only talking to the executive sponsor. The sponsor signs the contract but doesn't know the workflow. Interview at least 10 end users before writing production code.
  2. Treating the customer's PM as the source of truth. The customer's PM has the same blind spots an internal PM does — they're synthesizing other people's accounts. Get to the source.
  3. Synchronous-only interviews. 45-minute Zoom calls don't scale past 8–10 customers. By interview 20, the FDE is exhausted and data quality drops. Async conversational interviews fix this.
  4. Skipping the skeptic. Every org has one. They surface failure modes your project plan didn't account for. Talk to them in Week 0, not the post-mortem.
  5. Eval co-design as an afterthought. Building evals after launch means evaluating against the FDE's guess of "good" instead of the SME's definition. Co-design starts no later than Week 3.

Without eval co-design, FDE work collapses into vibes — and vibes don't survive a renewal.

Tools the Best FDEs Use for Customer Discovery in 2026

The 2026 FDE discovery stack has three layers; the conversational research layer is the new addition most teams underinvest in.

The full stack comparison is in our 2026 best tools for forward-deployed engineers stack comparison. For how to structure the function itself see why every AI startup needs a forward-deployed engineering function and how to build a forward-deployed engineering function — the founder playbook.

Frequently Asked Questions

What is forward-deployed engineer customer research?

Forward-deployed engineer customer research is the structured discovery work that embedded enterprise AI engineers do directly with a customer's end users, admins, and subject-matter experts to define system requirements, design evaluations, and operate continuous feedback loops. Unlike traditional solutions-engineering scoping, FDE customer research treats interviews and eval co-design as engineering inputs — equivalent in rigor to schema design — because LLM applications fail on edge cases that only domain experts know to surface.

How is FDE discovery different from a PM's customer research?

FDE discovery is technical, embedded, and eval-oriented; PM customer research is roadmap- and prioritization-oriented. An FDE runs discovery to define what the system has to do correctly on the long tail — regulatory, legacy-data, and edge-case constraints that only show up in production. A PM runs discovery to decide what to build next. The outputs differ: a PM produces a prioritized backlog; an FDE produces a labeled eval set and a co-designed rubric.

How many customer interviews should an FDE run before writing production code?

A typical FDE runs 15–40 async customer interviews in the first two weeks, plus targeted synchronous sessions with the eval owner and skeptic. The number depends on workflow diversity: a single-team deployment may need 15, a multi-region or multi-LOB deployment may need 40+. Async AI-moderated interviews make the upper bound tractable — synchronous-only programs stall around interview 8–10.

What's the relationship between customer interviews and LLM evals?

Customer interviews are the raw material for LLM evals. Themes and exception patterns surfaced in Week 1-2 become labeled examples in the eval set: golden examples (approved by the SME), edge cases (the "weird Tuesday" inputs), and adversarial cases (drawn from the skeptic's failure model). Without an interview corpus, evals collapse into engineering guesses; with it, the eval owner can co-design a rubric that reflects the customer's definition of "good."

Which companies have the most mature FDE customer-discovery practices?

Palantir invented the modern FDE model; Anthropic's applied AI engineers, OpenAI's forward-deployed team, Databricks' field engineering, and Cohere's deployed AI engineering are the most-cited 2026 examples. See our deep-dives on Anthropic's applied AI engineers and the forward-deployed Claude enterprise function, OpenAI's forward-deployed engineering team and customer-embedded AI, and Cohere's forward-deployed strategy of building enterprise LLMs with customers.

Can a single FDE realistically run discovery, build, and post-launch loops alone?

A single FDE can run a small-to-medium deployment alone with async conversational research infrastructure handling interview scale; without it, they can't. The bottleneck is not coding — it's the synchronous time required to interview, synthesize, and re-interview. Async AI-moderated interviews compress that from 40 hours/week to about 4 hours/week of human attention, which is what makes single-FDE deployments viable in 2026.

Conclusion

Forward-deployed engineer customer research is what separates AI deployments that survive renewal from pilots that die in production. The four-week loop — Week 0 stakeholder map, Week 1-2 async interviews, Week 3-4 eval co-design, post-launch continuous feedback — is the standard playbook at Palantir, Anthropic, OpenAI, Databricks, and Cohere. The unlock that made it tractable in 2026 is conversational research infrastructure: async AI-moderated interviews that let one FDE run discovery across 40 end users without scheduling a Zoom call.

If you're standing up an FDE function or trying to make the discovery half scale, Perspective AI is the conversational research layer FDE teams use to compress interview programs from months into weeks. Start a research project or explore the platform built for product teams.

More articles on AI Conversations at Scale