Building AI Products Without a Data Science Team

Two years ago, I was advising a mid-stage enterprise SaaS company that wanted to add intelligent document classification to their product. The data science team estimated three months to collect training data, build a custom model, and deploy it. The project stalled in prioritization for two quarters because leadership could not justify the headcount.

Last month, a product manager on that same team shipped the feature in two weeks. She used the OpenAI API, wrote a prompt, built an evaluation set of 200 documents, and iterated until accuracy hit 94 percent. No ML engineers. No training pipeline. No GPU clusters.

That moment crystallized something I have been watching across the industry for the past year. The bottleneck for AI products has shifted — from data science capability to product judgment.

What Changed

The short answer is foundation models. GPT-3 arrived in 2020 and hinted at what was possible. ChatGPT launched in November 2022 and made it undeniable. Now, with GPT-3.5 and GPT-4 available through OpenAI's API, product teams can access state-of-the-art language understanding with an HTTP request.

Before foundation models, building an AI feature meant assembling a pipeline: collect labeled data, choose a model architecture, train it, evaluate it, deploy it, monitor it. Each step required specialized expertise. Most product teams could not do this without hiring a data science team or contracting with an ML consultancy.

Now, that entire pipeline collapses into a few components: an API call, a well-crafted prompt, and a good evaluation framework. The model is already trained. The infrastructure is already managed. What remains is deciding what to build and how to evaluate whether it works.

This is not a small shift. In my experience, it is the most significant change in who can build intelligent software since cloud computing democratized infrastructure.

A Concrete Example: TaskFlow

Let me make this concrete with a fictional company I will call TaskFlow — a project management SaaS for mid-market teams, roughly 500 customers, 30 employees, no data scientists on staff.

TaskFlow's product team identified two pain points from customer interviews. First, project managers spent 20 minutes per day writing status summaries from task updates. Second, incoming support tickets were manually triaged and routed, which added latency and errors.

Here is what they built, and how.

AI Status Summaries. TaskFlow's backend already aggregated task updates into a structured feed. The engineering team wrote a function that collected the past week of updates for a project, formatted them into a prompt, and sent them to the OpenAI API with instructions to produce a concise status summary. The PM wrote the prompt, tested it against 50 real projects, and refined the wording until the summaries matched what a human PM would write. Total development time: one engineer, one PM, eight days.

Ticket Classification. For support ticket routing, the team sent each incoming ticket to GPT-3.5 with a prompt that included the five support categories and asked the model to classify the ticket. They built a simple evaluation harness that compared model classifications against 300 historically labeled tickets. First-pass accuracy was 82 percent. After refining the prompt with examples and edge case instructions, they reached 91 percent — better than the manual process, which they measured at 85 percent consistency.

Neither feature required training a model. Neither required a data scientist. Both required a product manager who understood the problem deeply enough to write a good prompt and evaluate the output critically.

What You Still Need

The absence of a data science team does not mean the absence of rigor. In fact, what I have found is that the skills required shift rather than disappear.

Product judgment matters more than ever. When anyone can add an AI feature in a week, the question is no longer "can we build this" but "should we build this, and what does good look like." The PM must define what success means before writing a single prompt.

Evaluation discipline is the new core competency. Without a labeled test set and a clear accuracy target, teams ship AI features that feel impressive in demos but fail unpredictably in production. TaskFlow's ticket classification would have launched at 82 percent accuracy without their evaluation harness — workable, but not better than the manual process it replaced.

Prompt engineering is a real skill, not a buzzword. The difference between a naive prompt and a well-structured one can be 15 to 20 percentage points of accuracy. This is not about tricks. It is about understanding what context the model needs to do its job, just as a good manager understands what context a new hire needs.

User testing remains essential. AI outputs are probabilistic. They will sometimes be wrong. The product must be designed so that users can catch errors, provide feedback, and override the model gracefully. TaskFlow added a "this summary is inaccurate" button that logged corrections — both for product improvement and for building a future evaluation set.

The New Build vs. Buy Decision

Not every AI feature should be an API call. In my experience, the decision framework looks roughly like this:

Use a foundation model API when the task is well-served by general language understanding, your data is not highly specialized, and speed to market matters. This covers a surprising range: summarization, classification, extraction, Q&A, translation, and content generation. For most product teams starting out, this is the right default.

Fine-tune a model when your domain has specialized vocabulary or patterns that the base model handles poorly, and you have at least a few hundred high-quality examples. Fine-tuning OpenAI models is now straightforward and does not require ML infrastructure expertise. But it does require a good dataset, which means you need evaluation discipline first.

Build a custom model when you have a truly unique data advantage, the task is narrow and well-defined, latency or cost constraints rule out API calls, or your data cannot leave your infrastructure for regulatory reasons. This is where you need data science expertise, and for most product teams in 2023, this is not where you should start.

The mistake I see most often is teams jumping to fine-tuning or custom models before they have exhausted what a well-prompted API call can do. Start with the simplest approach. Measure it. Only increase complexity when you have evidence that the simpler approach is insufficient.

Common Mistakes in the API Era

Having advised several teams through their first AI features over the past six months, I see a few patterns worth calling out.

Treating the model as magic. Teams ship a feature, see it work impressively in a demo, and assume it will work everywhere. It will not. Language models are stochastic. They will confidently produce wrong answers. Every AI feature needs a failure mode — what happens when the model is wrong, and how does the user recover.

Skipping evaluation. This is the most common and most damaging mistake. Without an evaluation set, you cannot measure whether your prompt changes make things better or worse. You are flying blind. Building even a small evaluation set of 100 to 200 labeled examples transforms your development process from guesswork to engineering.

Ignoring cost and latency. GPT-4 is remarkable but slow and expensive at scale. GPT-3.5 is faster and cheaper but less capable. The right model depends on your use case, volume, and latency requirements. TaskFlow used GPT-4 for status summaries (low volume, high quality bar) and GPT-3.5 for ticket classification (high volume, acceptable at lower capability). This kind of model selection is a product decision, not a technical one.

No fallback for failure. When the API is down, when the model returns nonsense, when the response does not parse — your feature needs graceful degradation. TaskFlow's status summary feature fell back to a structured bullet list of raw updates when the API was unavailable. Not as polished, but functional.

Why Product Managers Are the New Bottleneck

Here is what I find most interesting about this moment. For the past decade, the scarce resource for AI products was technical talent — data scientists, ML engineers, people who could train and deploy models. Companies hoarded this talent, built centralized AI teams, and created long queues of product teams waiting for ML resources.

Foundation models have largely dissolved that bottleneck. The new scarce resource is product managers who can think clearly about AI capabilities: what to build, how to evaluate it, when the model is good enough, how to design for failure, and how to iterate based on user feedback.

This is not a lower bar. It is a different bar. In my experience, the product managers who excel in this new environment share a few traits. They are comfortable with probabilistic systems — outputs that are right 90 percent of the time, not 100 percent. They think in evaluation frameworks, not feature checklists. They design user experiences that accommodate model errors rather than assuming perfection.

The companies that will build the best AI products in the next few years are not necessarily the ones with the largest data science teams. They are the ones with product leaders who understand what these models can and cannot do, and who have the judgment to ship features that create real value rather than just technical novelty.

If you are a product leader without a data science team, this is your moment. The tools are accessible. The barrier is no longer technical capability. It is product clarity — knowing what problem you are solving, how you will know it is working, and what you will do when it is not.

What do you think? I would love to hear your perspective — feel free to reach out.