Measuring Intelligence: Why AI Metrics Matter

6 Minutes read

“Building an AI agent is easy. Evaluating it, that’s another beast.”
François Zaninotto, Forum PHP 2025, “The secret recipe of not-too-stupid AI agents”

AI generated image illustrating why metrics matter

I work at ekino, and lately a big chunk of my team’s conversations have been about AI, not just how to build with it, but how to know if what we build actually works. When I attended Forum PHP 2025 last October, François Zaninotto from Marmelab said something that stuck with me: 90% of building an agent is measuring its outputs, not coding it. So I went down the rabbit hole, and this is what I found.

Classic code is deterministic: you feed input, you check output. But with LLMs (Large Language Models), outputs are probabilistic and unpredictable. The same prompt can yield ten valid answers, or ten wrong ones.

If you want to push an AI from “cool toy” to “reliable feature”, you need robust, repeatable, meaningful metrics. Because as Zaninotto put it: 90% of building an agent is measuring its outputs, not coding it.

In this article we’ll talk about what that means for the next AI powered feature we throw into production.

What we used to rely on: classical NLP metrics

Before LLMs took over, NLP (Natural Language Processing) tasks like translation, summarization or simple Q&A used well-known metrics. These still exist, but quickly show their limits.

BLEU: word-by-word comparison

BLEU (Bilingual Evaluation Understudy) checks how many n-grams (single words, pairs, triplets…) of the generated text match a reference text.
It works well when you expect a specific answer, but not as good when many acceptable answers exist.

Example: “Paris is the capital of France.” vs. “France’s capital city is Paris.”
Near-zero BLEU, even though both are correct.

ROUGE: recall-oriented for summaries

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is used mostly for summarization tasks. Measures how much of the reference is “covered” by the summary (common n-grams, longest common subsequence). Better than BLEU for flexibility, but still very rigid.

METEOR: smarter matching

METEOR (Metric for Evaluation of Translation with Explicit ORdering) allows synonyms, morphological variants, small reorderings. More forgiving than BLEU/ROUGE, but still rooted in literal similarity.

If you want to read more about these metrics, I recommend Avinash’s article about them where he explains them in detail.

The main issue is these metrics need a “ground truth”, a reference text. They don’t cope well with open-ended generation, creative responses, or varied valid outputs.

For most modern AI-agent tasks, they are necessary but not sufficient!

What changed with LLMs: new dimensions of evaluation

LLMs killed the “correct vs incorrect” mindset. What matters now is whether the output is actually useful, safe, efficient, and stable.

Here are the main dimensions of evaluation you need when working with generative AI:

Text quality: clarity, fluency, readable output.
Factual correctness: avoids hallucinations and invented facts.
Structure: valid JSON, correct schema, required fields.
Reasoning : coherent logic, no contradictions.
Safety & bias: prevents harmful, toxic, or biased outputs.
Performance & cost: latency, token usage, stability under load.

These dimensions map to metrics, not all quantifiable in the same way — some requiring heuristics, others external validation, others a hybrid human/automated judgement.

Modern AI evaluation: how people actually do it

Here are the three main approaches you’ll find in production or serious projects.

AI generated image illustrating how people evaluate AI

1. Judge model, an AI that evaluates other AI

You feed a “judge” LLM: the prompt, the candidate answer, maybe a reference, and a checklist of criteria. It returns a structured evaluation: a score, a pass/fail, or a graded rubric (fluency, correctness, tone, etc.).

This works well for:

factual tasks (is it true or has the LLM hallucinated?)
style / clarity evaluation
multi-criteria grading

Judging with an LLM can give surprisingly good results, better than simple similarity metrics when dealing with open-ended answers. Some LLM providers like Google even have their own metrics to evaluate judge models, same with Amazon Bedrock, so don’t hesitate to take a look at your favorite provider’s documentation if you’re searching for some.

2. Embedding similarity (semantic distance)

Encode both reference and candidate (or candidate and context) into embeddings, compare via cosine similarity or distance. Useful for:

relevance checks
semantic consistency
detecting hallucinations or incoherence

Less strict than BLEU, it adapts to paraphrases. It doesn’t guarantee factual truth, but good for “sense check”. If the topic interests you, I recommend reading this article about how embeddings work.

3. Structural & validation metrics

For outputs consumed by machines, the first line of evaluation is structural integrity: valid JSON, schema conformity, required fields, and correct types. These metrics don’t assess usefulness or truth — they ensure the output can be parsed, routed, and executed safely by downstream systems.

This is especially critical for tool calling, JSON-RPC, or MCP servers, where a single malformed payload can break the entire integration.

Tools & Libraries you can actually use

A lot of evaluation tooling has now matured since the creation of LLMs. Here are the main ones you can pick depending on your stack.

DeepEval (Python)

Probably the most used evaluation framework now for LLMs.

Some features include:

Judge-model evaluation
Embedding-based metrics
Safety / toxicity / bias detection
Configurable scoring
Integrates with pipelines or CI/CD

Example (Python):

from deepeval.metrics import AnswerRelevancy

metric = AnswerRelevancy()

score = metric.measure(
    input="Explain quantum computing simply",
    prediction=generated_text,
    ground_truth="Quantum computing uses quantum bits..."
)

print(score)

Great if you run a backend in Python (microservice, worker, evaluation pipeline).

Ragas, for RAG / retrieval-augmented generation

If your agent returns answers built from retrieved documents, Ragas adds metrics like:

faithfulness (did the answer stick to source docs)
hallucination rate
context relevancy
precision / recall of cited info

It’s a go-to for chatbots, knowledge-base assistants, etc.

LangSmith (via LangChain ecosystem)

Useful when you orchestrate many prompts and models.

What it gives you:

detailed run logging
prompt & model versioning
analytics / dashboards
ability to compare models & prompts over time

Good when you want to track “what changed between v1 & v2” or “which prompt performs best for this use-case”.

For PHP: yes, you can stay in PHP land!

Even though many tools target Python, there are some options for PHP:

php-llm-evaluation implements BLEU / ROUGE / METEOR + custom string-based comparisons. Useful for simple tasks, or generating baseline metrics from PHP.

And if your backend is PHP (Symfony, Laravel, etc.), you can build a custom evaluation pipeline:

store runs in DB (prompt, response, metadata),
call a judge LLM via API,
compute metrics (structural, cost, length, etc.),
log everything for later analysis.

Given PHP’s presence in web backends, this path keeps you in your standard stack, no need to spin up a Python worker if you don’t want to.

How to integrate metrics into your AI project (workflow + best practices)

AI generated image illustrating how to integrate metrics into a project

If you build an AI-powered feature (chatbot, assistant, content generator…), here’s an evaluation-oriented workflow you should adopt from day one:

Define your success criteria: what does “good” mean for your feature? (factuality, style, speed, cost…)
Write evaluation specs: list your metrics (text quality, structure, safety, cost, etc.).
Automate evaluation: use judge models, embedding checks, structural validation, cost tracking.
Log everything: prompt, model, config, output, metrics, cost, timestamp.
Version & compare: prompts/parameters evolve, keep results to compare over time.
Integrate in CI/deployment pipeline: treat metric regressions like test failures.
Iterate: adjust prompts, switch models, tune parameters, re-evaluate.

In short: treat LLM output like you would treat software: test it, version it, monitor it.

Building an AI agent is only half the job.
Making it reliable, safe, efficient, maintainable, that’s where metrics step in.

Ignoring metrics comes at a real cost, because without proper evaluation and tracking, you risk shipping unreliable or hallucinating agents, dealing with unstable behavior every time the model updates, and it can be expensive due to hidden token usage or latency overhead.

You also lose visibility on regressions caused by prompt changes and may unknowingly expose users to bias or toxic outputs: for instance, in May 2025 xAI’s chatbot Grok unexpectedly began generating unrelated “white genocide” themed responses after a prompt update, or the 2025 lawsuit in which the family of a 16-year-old alleges that prolonged interactions with ChatGPT contributed to his suicide because the chatbot reinforced harmful ideation. So metrics aren’t optional, and for any production level AI feature, they become a requirement.

As François Zaninotto said: the secret sauce isn’t just in clever prompts or fancy models, it’s in how you measure and monitor what your agent does.

If you treat your AI outputs like logs or test results, you give yourself a shot at building something robust, not just a “cool experiment”.

Before that talk, metrics were kind of an afterthought in how I thought about AI features. Now they feel like they are required. If you’re shipping anything with an LLM in it, even something small, I’d really encourage you to think about evaluation before you think about prompts. It’ll save you a lot of pain later!

Measuring Intelligence: Why AI Metrics Matter was originally published in ekino-france on Medium, where people are continuing the conversation by highlighting and responding to this story.