AI Agents
Our team at Kiseki Labs attended the 2025 AI Engineer Summit in New York from February 20-22. Now in its third year, this summit has established itself as the premier technical conference worldwide for AI engineers and leadership to connect, share knowledge, and advance the field collectively. This year's theme, "Agents at Work," placed special emphasis on production implementation stories, particularly in Finance—a sector where reliable AI is non-negotiable.
The summit brought together the industry's leading AI Agent builders from companies like OpenAI, Anthropic, LinkedIn, Bloomberg, and Jane Street, who shared their real-world experiences, challenges, and solutions as we move deeper into what many are calling "The Year of Agents."
Drawing from our notes and observations across both days, we've distilled eight key themes that reveal where the AI landscape is heading and what organisations need to prioritise in their AI strategy:
Positioned between Software and ML Engineering, AI Engineering is carving out its own identity. As Swyx (Shawn Wang), the co-organiser of the conference and co-host of the popular Latent Space podcast, highlighted in his keynote, "AI Engineering will emerge as a separate discipline" with its own methodologies and best practices. He observed that the field is currently "torn between software and ML engineering," but rapidly establishing its unique characteristics.
The skills required blend traditional software engineering principles with machine learning expertise, but with unique additions. Xiaofeng Wang, LinkedIn's Engineering Manager for Generative AI Foundations, shared how their teams are restructuring around this new discipline, noting that "anyone can be an AI Engineer if they can improve system behaviour." Their hiring strategy now focuses on engineering skills over ML expertise, valuing diverse backgrounds and critical thinking.
This evolution is creating new career paths and organisational structures. Companies like Jane Street shared how they're now balancing engineering talent with domain experts in a collaborative environment, focusing on "potential with growth mindset instead of pure experience." This is particularly important for specialised domains, as Jane Street demonstrated by building AI capabilities for OCaml—a functional programming language they use extensively but which has minimal presence in public repositories. To succeed with this niche technology, they created hybrid teams combining traditional developers with AI specialists, and established new roles specifically focused on the intersection of domain expertise and AI capabilities. The result wasn't just better technology, but a fundamental restructuring of how engineering teams operate and collaborate.
Several speakers shared their assessments of real-world AI implementations: what looks impressive in demos frequently disappoints in production. The AI Snake Oil team provided eye-opening examples of high-profile legal AI tools that regularly hallucinate and deliver incorrect information.
The gap between what AI can theoretically do versus what it can reliably do remains significant. "Capability does not mean reliability" was a point repeatedly emphasised by speakers throughout the summit, highlighting this crucial distinction. The team from Writer noted that in domain-specific tasks, models "have a propensity to give wrong answers and proceed to reply when they should refuse to answer."
This realisation is shifting the industry focus from pushing boundaries to ensuring dependability. The new mantra appears to be "Being an AI Engineer is being a reliability engineer," highlighting how fundamental this concern has become. Benchmark performance rarely translates into real-world success, as evidenced by critical reviews of highly publicised tools.
Contextual AI's founder emphasised how "the gap between pilot and production is always larger than you expect." This reality check was a recurring theme across presentations from companies with agents in production. Douwe Kiela, drawing from his experience deploying complex AI systems as CEO of Contextual AI, noted that while it's relatively easy to build impressive demos, the real challenge lies in productionising these systems for real-world use.
Speakers stressed the importance of designing for production from day one rather than treating it as an afterthought. Sierra's approach of viewing "every agent as a product" exemplifies this production-first mindset. Their team focuses on customer experience from the outset and has developed extensive QA processes, including looking at every conversation to continuously improve their agents.
The challenges of production extend beyond technical considerations to include team structure and organisational approaches. Bloomberg's presentation highlighted the importance of flexible team structures during early AI agent development. They emphasised that whilst organisations are still learning what works, they should maintain adaptable team configurations and system architectures. Their practical advice included implementing certain capabilities—such as guardrails and safety checks—as shared services across all agents rather than building them individually for each use case. This approach demonstrates how production readiness requires thoughtful organisational design alongside technical implementation.
As inference costs remain significant at scale, optimisation strategies like fine-tuning and distillation are becoming critical. The Method and OpenPipe team explained how they tackle a complex challenge: processing unstructured financial data from hundreds of sources into actionable information for banks. Before AI, this required offshore contractors manually calling banks and gathering information — an inefficient process prone to errors.
Method's journey provided a cautionary tale — their first month with GPT-4 agents resulted in a $70K bill, forcing them to rapidly rethink their approach. Their partnership with OpenPipe led to an innovative solution: distilling knowledge from expensive GPT-4 and o3-mini models into a much smaller Llama 3.1 8B model. This approach dramatically reduced costs while maintaining acceptable quality thresholds.
The "inference trifecta" of quality, cost, and latency requires careful balancing. OpenPipe's approach of measuring error rates per model against business-specific thresholds offers a practical framework for making these trade-offs. Their process involves consulting directly with business stakeholders to establish acceptable performance levels for each metric based on the specific use case, rather than applying arbitrary technical standards.
The results demonstrated how economic constraints drive innovation. The Method team achieved a remarkable scale — 500M agents with just two engineers. Their advice was pragmatic: "Don't buy your own GPUs" and "fine-tuning should be a last resort, to be explored when other options with off-the-shelf models fail."
Evaluation frameworks have evolved from nice-to-have to business-critical. As emphasised by both OpenAI and Anthropic, "Evals are your company's competitive advantage" in a market where everyone has access to similar models. Anthropic recommended designing "tests aligned with real-world use cases and validating with user-centric metrics."
Bloomberg's philosophy of building "antifragile systems" with robust guardrails reflects a growing consensus. Their approach assumes agent outputs will be wrong and implements verification at each step. For Bloomberg, certain qualities are non-negotiable: "precision, comprehensiveness, and speed." Their guardrails check agentic steps after each execution, rather than trying to prevent errors through increasingly complex prompts.
Observability emerged as a cornerstone of trustworthy AI systems. LinkedIn demonstrated this by investing significantly in OpenTelemetry-based monitoring infrastructure. This investment allows their teams to track agent behaviour in detail, replay historical data, and rerun agentic workflows when issues arise. Xiaofeng Wang, LinkedIn's Engineering Manager for Generative AI Foundations, emphasised that this observability layer isn't merely for debugging—it's fundamental to building trust in AI systems by creating transparency and accountability. When agents make decisions, their actions can be traced, understood, and corrected, which is essential for maintaining stakeholder confidence in high-stakes enterprise environments.
UX importance outweighing model choice was a surprising theme across multiple talks. Douwe Kiela, Contextual AI's CEO and Co-Founder, emphasized that "Better LLMs are not the answer" when system design and user experience are suboptimal. From Contextual AI's experience deploying enterprise AI systems, Kiela explained that while many organisations focus on having the most advanced language model, greater success often comes from improving how the system retrieves relevant information before passing it to the model—advocating for a "think systems, not just models" approach to AI development.
BrightWave, a company specialising in AI systems for financial analysis, warned of the "latency trap"—the false assumption that longer processing times yield better results. Mike Conover, their representative at the summit, explained their solution: giving users visibility into the process and control over the depth of analysis. As he explained, "Even if you wait longer, you are not guaranteed you're going to get a better answer." His advice for their financial analysis tools: "Give the user the ability to drill down on any specific part of the AI-generated financial report rather than waiting for a perfect comprehensive analysis."
The industry is rapidly moving beyond chat experiences toward multimodal interfaces. Google's Deep Research demonstrates this shift with its combination of research plans, real-time website browsing visibility, and comprehensive sourcing—transforming the user experience from passive waiting to active engagement. Grace Isford, a Partner at Lux Capital (a venture capital firm investing in emerging technologies), emphasised the need for "giving AI eyes, ears, voice, nose, and touch" to move beyond text-only interactions.
The narrative is evolving from AI replacing humans to augmenting human capabilities. OpenAI's presentation on creating agents that co-create described "co-innovators" as the next paradigm, combining agent capabilities with human creativity. This vision brings together "agents + creativity (human-AI collaboration)," moving beyond pure automation.
Domain expertise paired with AI consistently produces superior outcomes. Jane Street's success with OCaml code generation, despite limited training data, exemplifies how human knowledge can guide AI to excel in specialised domains. To overcome the lack of public OCaml code examples, they developed an innovative data collection approach: their system automatically captured snapshots of developers' code editors every 20 seconds, along with information about whether the code successfully compiled. This created a rich dataset of both working code and errors with their fixes, enabling their AI to learn from real programming patterns and workflows.
Gen Z users, who comprise 70% of GenAI users according to Google's research, are driving this collaborative approach. They "want to co-create with AI" rather than have AI do everything, suggesting a generational shift in how AI is perceived and used. Google's team emphasised that "agents should be transparent about their limitations" to build trust with these users who expect collaboration rather than replacement.
Voice interfaces present both opportunities and challenges. SuperDial's small team demonstrated practical voice agent implementation by using multimodal approaches (combining text and audio) rather than waiting for perfect voice-to-voice models. Their success hinged on addressing voice-specific concerns like latency, pronunciation rules, and having robust fallback options. Their practical advice: "Don't build from scratch. Leverage existing tech and track latency everywhere."
Reinforcement Learning (RL) is transforming agent capabilities in unexpected ways. Will Brown, an ML Researcher at Morgan Stanley, introduced "rubric engineering"—a novel approach where developers create structured evaluation frameworks (rubrics) that can be used to train and improve AI agents through reinforcement learning. Unlike traditional prompt engineering that focuses on crafting perfect instructions, rubric engineering defines clear success criteria that agents can be optimised against over time. DeepSeek's results demonstrated how this RL-based approach enables more complex reasoning, with "long chains of thought" emerging naturally through this process.
Personal agents raise important questions about privacy and control. Meta's PyTorch team advocated for keeping these agents local rather than cloud-based, especially as they gain more powerful capabilities like email management. While technical hurdles remain with open-source voice models, they emphasised that local AI is increasingly viable as open models rapidly improve. The key advantage of local agents is their "powerful action space" combined with enhanced "privacy, security, and control."
The 2025 AI Engineer Summit highlighted the significant transformation underway in AI development. The consistent message across presentations was clear: "Start simple, focus on reliability over capability, and design for production from day one."
Many speakers noted that while model capabilities are improving rapidly, the most successful AI implementations aren't just about using cutting-edge technology. Throughout the summit, real-world case studies demonstrated that combining domain expertise with AI leads to superior outcomes. This highlights a clear shift in thinking: the future belongs to "co-innovators" rather than mere automation tools, moving us from replacement to collaboration.
The path forward requires balancing innovation with practicality. For enterprise-grade AI systems, precision, comprehensiveness, and speed remain non-negotiable. Equally important are robust evaluation frameworks - as emphasised throughout the summit, "Evals are your company's competitive advantage" in a landscape where everyone has access to the same models.
At Kiseki Labs, we're excited to be part of this journey. Whether you're just starting your AI transformation or looking to scale existing implementations, we're here to help you navigate this rapidly evolving landscape. Book a free AI consultation with us to explore how we can help you build your AI-powered future.
LLMs
March 19, 2025
This blog post explains what DeepSeek v3 is and why it matters for frontier LLM development.
Read more
Business
March 18, 2025
Drawing insights from Menlo Ventures' latest report surveying 600 enterprise leaders, we examine why AI spending surged 6x in 2024, how organisations are building their own AI capabilities, and what's driving value in enterprise AI adoption.
Read more
AI Agents
December 1, 2024
In this post I provide some reasons for why AI Agents are a useful paradigm for developing LLM-based applications that can tackle increasingly complex problems. Read on to learn more!
Read more