A well-built psychometric assessment tells you who the person is. It doesn’t tell you how good they are at what the role demands.
That sentence is uncomfortable for someone who works in People Science. We spend our careers demonstrating the predictive validity of assessments, defending OCEAN against skeptics, calibrating models. But there’s a point where the assessment, on its own, hits a ceiling. And that ceiling is where multi-evaluator competency assessment comes in.
This post is about where that ceiling is, why it exists, and how the two layers combine in a scoring system that respects the evidence.
The conceptual difference: trait vs observed behavior
A psychometric assessment measures dispositional traits: stable tendencies to behave in certain ways. Someone with high Conscientiousness will tend to be organized, planful, reliable.
A competency assessment measures observed behavior in a specific domain: how well this person does X concrete thing, judged by people who have seen them do it.
Both are useful. Both measure different things. Confusing them is one of the most common mistakes in HR.
Schmidt and Hunter (1998), in their meta-analysis of 85 years of selection research, found that cognitive ability tests (GMA) have a predictive validity of r=.51 against job performance, and personality tests — specifically Conscientiousness — around r=.31. But structured interviews reach r=.51, and work sample tests reach r=.54. Translation: assessments that get closer to the concrete behavior of the role predict at least as well as tests, and sometimes better.
The operational conclusion: if you can combine both, you do. If you have to pick, it depends on the role.
The two modes of the competencies module
The platform handles competency assessment in two modes. The split isn’t arbitrary — it reflects the distinction above.
Soft mode: inferred competencies
Some competencies — communication, adaptability, customer orientation, conflict management — have a stable relationship with OCEAN traits and configured values. Not perfect, but robust enough to infer a score.
The engine maps, through a calibrated table, which combination of OCEAN traits + values adherence predicts each soft competency. The result is a 0-100 score computed automatically from the candidate’s assessment, with no human evaluators required.
This is useful for:
- Generalist roles where the assessment is predictive enough
- Early funnel stages where you can’t afford the cost of human evaluators
- Intrinsically dispositional competencies (curiosity, resilience, empathy)
Technical mode: human-evaluated competencies
Other competencies — advanced SQL, enterprise contract negotiation, distributed-system architecture — can’t be inferred. You have to see them. For these, the module lets you invite human evaluators who assign a score and leave a qualitative note. The system computes the average.
This is useful for:
- Technical roles where the specific skill is critical
- Senior roles where concrete experience weighs more than disposition
- Domain-specific competencies where the assessment has no signal
How they combine: required_level by role
This is the piece that ties everything together. For each role, the company defines a set of required competencies with their respective required_level. For example:
| Competency | Mode | Required Level |
|---|---|---|
| Written communication | Soft | 70 |
| Teamwork | Soft | 65 |
| Advanced SQL | Technical | 80 |
| Data pipeline design | Technical | 75 |
| Adaptability | Soft | 60 |
The candidate gets a score for each competency. Soft scores come automatically from the assessment. Technical scores come from the evaluator average. The gap between score and required_level is the relevant metric: if the candidate gets 85 on advanced SQL and the required level is 80, there’s a positive margin. If they get 65, there’s a 15-point gap that gets reported to the hiring manager.
The candidate’s final fit is then composed of three layers:
- OCEAN fit against the role’s ideal profile
- Values fit against the company’s configured values
- Competency fit, weighted average of soft + technical, against the required_levels
If you want to see the module in detail: /features/competency-evaluations.
Why multi-evaluator is non-negotiable for technical
When a single person evaluates a technical competency, you’re measuring two things: the candidate’s skill and the evaluator’s bias. And you can’t separate them.
The literature on inter-rater reliability in structured interviews is conclusive. Kuncel, Klieger, Connelly and Ones (2013), in a meta-analysis published in the Journal of Applied Psychology, showed that using mechanical combination (average or algorithm) of multiple evaluators significantly improves predictive validity over holistic judgment (a single evaluator decides). The effect was robust across selection types.
The reason is basic statistics: individual noise cancels out when you average multiple independent measurements. Evaluator A’s bias is partially offset by evaluator B’s bias, provided the biases aren’t systematically correlated (which is why panel diversity matters).
Highhouse (2008) pushed this further in “Stubborn Reliance on Intuition and Subjectivity in Employee Selection,” documenting how managers persist in trusting individual judgment despite consistent evidence that structured, multi-evaluator processes are superior.
The module implements this by:
- Letting each technical evaluation invite multiple evaluators
- Having each evaluator score independently, without seeing the others’ scores
- Computing the average and reporting the dispersion (high variance between evaluators is a signal to investigate)
- Pairing every score with a qualitative note, so the average doesn’t hide the reasoning
When to use soft, when technical, when both
This is the most frequent operational question. Short answer: it depends on how much the specific technical skill weighs in role success.
| Type of role | Soft | Technical | Why |
|---|---|---|---|
| Junior customer service | Sufficient | Optional | Assessment + values cover most of the variance |
| Generalist sales | Sufficient | Optional | Dispositions predict commercial performance well |
| Mid-level developer | Recommended | Critical | Concrete technical skill has to be seen, not inferred |
| Senior data engineer | Recommended | Critical | Same idea, with more technical weight |
| Manager with a team | Critical | Recommended | Soft skills matter a lot, but observed management competencies add value |
| Director / VP | Critical | Critical | Mandatory combination — the cost of a bad hire is very high |
The general rule we give clients: the more senior the role and the more technically specific the output, the more weight the multi-evaluator technical layer should carry. The more generalist and dispositional, the more the soft layer is enough.
What the combination enables: per-application insights
Beyond the final score, combining the two layers produces insights that neither one yields on its own:
- Specific gap detection. “This candidate has good OCEAN fit and aligned values, but their advanced SQL is 12 points below required. Recommendation: hire with an upskilling plan, or pass.”
- Over-qualification detection. “Scored 95 on a competency with a required of 70. Will they get bored in the role?”
- Panel calibration. “Evaluator X consistently scores 15 points below the rest of the panel. Their calibration needs review.”
- Post-hire development plan. “The two lowest-scoring competencies become the focus of onboarding and the first 90 days.”
None of these insights come out of an assessment alone. And none come out of technical interviews alone without an assessment giving the dispositional baseline. The combination is what enables them.
The argument against “expert intuition”
There’s a recurring resistance to structured multi-evaluator processes, and it’s important to name it: “I’ve got 20 years of experience, I know in 10 minutes whether the person is right.” That’s probably the most expensive sentence in the industry.
The evidence is consistent: confidence in expert intuition correlates weakly with the actual accuracy of hiring decisions. Highhouse called it “stubborn reliance.” Multi-evaluator structured assessment doesn’t replace experience — it anchors it in a process where individual bias gets diluted and the reasoning becomes explicit and auditable.
If your organization has to defend a no-hire decision six months later, having three evaluations with qualitative notes is very different from having “the director felt they didn’t fit.”
How it fits with the rest
Competencies are one of the four layers of Talen.to’s scoring engine: OCEAN, Values, Competencies and Archetypes. The hub post has the full map. And if you want to understand how archetypes calibrated with real data connect to all of this, here’s the archetypes deep dive.
Implement this with us
If you’re hiring senior or technical roles with assessment only, or with interviews only, you’re leaving meaningful predictive variance on the table. We help you define the required_level per role, build evaluator panels, and connect the soft layer with the technical layer so the final fit score reflects both.
Book a 15-minute demo and I’ll show you the module running on a real role.
Questions? Email me at clara@talen.to.
Related OCEAN+ profiles
Discover which personality dimensions to look for in each role.
Related Articles
Talen.to's 10 behavioral archetypes (and why we calibrated them with our own data)
These aren't archetypes borrowed from a management book. They're centroids computed in a 7-dimensional space from real field data. How the model works, what it predicts, and why in-house calibration changes everything.
Algorithmic transparency in HR Tech: why your scoring engine should be auditable
Most assessment tools are black boxes. Why that's bad practice, what regulation requires (EU AI Act, NYC AEDT), and what an auditable, configurable scoring engine looks like.
Why generic assessments fail at culture fit (and what to do about it)
Traditional assessments evaluate the candidate against a general population. Talen.to evaluates them against your role, your company, and your cultural context. Extended OCEAN, cultural factors, 10 archetypes calibrated with real data.