Definition. AI in HR software refers to capabilities driven by machine learning models (predictive ML), large language models (generative AI) or autonomous reasoning systems (agentic AI), embedded within HR applications such as HRIS, talent acquisition, learning, performance or workforce analytics platforms. The category covers everything from a recommendation engine flagging at-risk employees, to a chatbot answering policy questions, to an agent autonomously screening candidates.
TL;DR
- Start with the problem, not the AI.
- Apply standard triage - hygiene, market standard, differentiating.
- Test with data, not demos.
- Govern AI on your side; vendors cannot give you governance.
- Contract every AI commitment before signature, not after.
What you’ll learn
| Start with the problem | If you’re starting from ‘we need AI’, you’re already evaluating wrong. |
| Triaging AI capabilities | Most AI is market standard, not differentiating. Classify before vendors influence you. |
| Value-driven decision making | Without a value case, AI is unaffordable at scale. |
| Data tests | The single most useful evaluation instrument for AI. Design to neutralise vendor pushback. |
| AI governance | A buyer-side responsibility. Vendors can’t give you governance. |
| Capture commitments contractually | If it isn’t in the contract, it doesn’t exist. |
| Platform lock-in | AI deepens lock-in. Contract for the exit while you have leverage. |
| After go-live | Evaluation doesn’t stop at signature. Measure post-launch. |
On this page
- Start with the problem
- What AI in HR software means
- The vendor sales paradox
- Triaging AI capabilities
- The six capability areas
- Value-driven decision making
- Data tests
- POCs and pilots
- AI governance
- Employee buy-in and change
- Regulatory snapshot
- Capture commitments contractually
- Platform lock-in
- How this fits the method
- After go-live: measurement
- FAQ
Start with the problem, not the AI
The most common mistake I see buyers make in 2026 is the same mistake they were making in 2018, just with louder marketing: letting vendor capabilities define the solution. AI amplifies the problem because the technology is novel, the demos are compelling and the temptation to anchor a procurement around ‘we need AI’ is strong.
In my book I describe the three root causes of HR technology failure: the technology solves the wrong problem, the wrong vendor is chosen and the implementation is botched. An UNLEASH study found that 42% of HR tech implementations had failed or underperformed two years after installation1, and PwC reports that 36% of buyers are likely to switch vendors at contract renewal2. AI doesn’t change those patterns. If anything, it accelerates them. A vendor with an impressive AI demo and a thin product can take a buyer further off course, faster, than a vendor with no AI at all.
The right starting point is design thinking: empathising with users, defining the real problem, ideating solutions and only then asking whether AI has a useful role to play. Phase A of the selection process (‘Know What You Want’) is the first defence against AI-led procurement.
Once you’ve decided you need AI, that decision sits inside a problem you can measure success against. You no longer evaluate ‘AI capability’ in the abstract. You evaluate whether each vendor’s AI solves your problem better than the alternatives, including doing nothing.
What “AI in HR software” actually means
The term is used loosely. For evaluation purposes, it helps to distinguish three categories:
Predictive machine learning. Models trained on historical data to score, classify, or predict. Examples: attrition risk scoring, candidate match scoring, anomaly detection in payroll. Mature technology, well-understood evaluation methods.
Generative AI. Large language models producing text, summaries, or structured outputs. Examples: chatbot interfaces, automated job description generation, performance review summaries, conversational policy lookup. Rapidly evolving; evaluation methods less standardised.
Agentic AI. Systems that combine generative models with tools and the ability to take actions autonomously. Examples: an agent that screens candidates, schedules interviews and drafts rejection emails. The newest category; vendor offerings range from real to vapourware.
Autonomous action is qualitatively different from generative assistance. Agentic systems introduce challenges that don’t arise with summarisation or recommendation: orchestration complexity, auditability of chained decisions, prompt drift over time, tool-permission and security boundaries and failure containment when the agent acts on incorrect reasoning. In HR specifically, autonomous action creates exposure to employment law risk, discrimination claims, opaque accountability and reputational damage. I’d treat agentic AI claims with substantially more scrutiny than other AI categories, particularly any agent that makes or executes decisions affecting employment, pay or progression.
Each category has different risk profiles, different evaluation techniques and different regulatory implications. Most ‘AI’ labels in HR vendor marketing today cover predictive ML or generative AI; agentic AI claims warrant the most scrutiny.
You’ll also encounter two architectural patterns: embedded AI (the vendor’s own model, trained on their data, integrated natively into the product) and bolt-on AI (a third-party model, typically from OpenAI, Anthropic or Google, accessed via API). Neither is better in the abstract, but the questions you ask vary by pattern. A bolt-on solution means you’re also evaluating the vendor’s API supplier, indirectly.
The vendor sales paradox in the age of AI
In the preface of my book I quote an industry veteran: ‘If you understand nothing else when selecting software, understand that software vendors are incented to say "yes". Very few will flat-out lie to you; however, if there is any possibility of a "yes", you will get a "yes".’ That observation has aged well. Vendor sales teams are skilled in reading buyers, not in domain expertise. Apparently, Salesforce discovered that their best salespeople come from the car sales world. I’ve sat through hundreds of vendor demos and witnessed many very polished presentations. In the AI era, this dynamic intensifies.
‘AI-washing’ is the practice of presenting non-AI capabilities as AI, or basic AI as proprietary advanced AI. It takes several common forms:
- Rules-based automation rebadged as AI (‘AI-driven workflow’)
- Generic LLM wrappers presented as proprietary models
- Roadmap features demonstrated as if they were shipped product
- ‘AI-ready’ platforms with no AI features actually deployed
- Demos that rely on cherry-picked data the buyer can’t reproduce
The asymmetry is greater than with traditional features. With a workflow engine, you can usually tell from a demo whether the capability exists. With AI, the technology obscures rather than reveals: a confident answer from a chatbot doesn’t tell you whether the underlying model is good, accurate or even consistent.
The defences are the same as for any other capability, but applied more rigorously: insist on shipped product, demonstrated against your data, with commitments written into the contract.
Triaging AI capabilities: hygiene, market standard or differentiating?
Buyers routinely over-score AI as a category. They treat it as inherently differentiating when, in most cases, it isn’t. In my book I argue strongly for triaging requirements into three groups, each treated very differently during evaluation. The same discipline applies to AI capabilities.
‘Hygiene’ requirements are pass/fail. For AI specifically, hygiene includes: data residency and sovereignty, bias governance and audit, model risk classification, GDPR compliance for automated decision-making (Article 22), security of training and inference data and EU AI Act conformity for high-risk HR systems. Fail any of these and the vendor is eliminated regardless of other factors.
‘Market standard’ AI capabilities are what leading vendors routinely offer: intelligent search, summarisation, basic conversational interfaces, predictive scoring on common HR signals. Many first-generation AI features are commoditising at feature level, but execution quality, integration depth and operational usefulness still vary materially between vendors. Two vendors may both claim ‘AI summarisation’ while one is transformative and the other is barely usable. Confirm market-standard features exist, then assess execution quality without treating them as differentiators.
‘Differentiating’ AI capabilities are where scoring matters. These are the AI features that, for your specific problem and context, deliver materially different value across vendors. Examples vary by use case but might include: domain-specific model accuracy on your data, agentic workflow that genuinely removes a process step, integration of AI with your data lake or a unique training approach.
The discipline is in correctly classifying. Most vendor AI marketing positions market-standard features as differentiating. Do the classification yourself, before vendors get to influence you.
Evaluating AI across the six capability areas
In my book I describe six capability areas every HR tech evaluation should cover. AI sits inside this structure, not alongside it.
Functional. What problem does the AI solve, and does it solve it in your context? Functional evaluation for AI starts with use case fit. A model that achieves 92% accuracy in the vendor’s benchmark may collapse to 60% on your data. Ask: what is the AI’s job, and how will we know it’s done it?
User experience. How does AI surface in the flow of work? Is it explainable to the user? What is the human override path? An AI feature that doesn’t show users why it made a recommendation won’t be trusted, and untrusted features don’t get used.
Technical. What model is it: proprietary, fine-tuned or third-party? Where does inference happen? What customer data is used for training versus inference? What integration patterns are supported? For bolt-on AI, the technical evaluation extends to the API supplier’s terms.
Service delivery. How often is the model updated? Are you notified before changes? What happens when the AI gets it wrong, and what is the support escalation path? Model regressions are a real risk: the AI that worked for you in month one may behave differently in month nine.
Commercial. AI introduces pricing models that didn’t exist five years ago: token-based, query-based, agent-based, capacity-based. Understand the unit economics. A pilot priced at ‘free for the first 10,000 queries’ becomes a different conversation at scale.
Implementation. What does ‘go-live with AI’ actually mean? Many AI features require customer data to be useful. Plan for the data preparation, governance setup and bias monitoring required before the AI delivers value. Implementation readiness work should begin during Phase C, not after contract signature.
Value-driven decision making
Evaluation only matters if you can value what you’re evaluating. The right vendor is the one that drives the best return, not the one with the highest score on a generic capability matrix and not the one with the most impressive AI. Without a defensible value case, scoring becomes subjective and high-cost AI features get over-rewarded simply because they’re visible.
A value driver tree is the structuring instrument I recommend, mapping strategic objectives, benefits, value drivers, metrics and solution capabilities. Built properly in Phase A, it tells you which AI capabilities matter to your value case, what targets they need to hit and which vendors deliver them.
AI consumption pricing sharpens this discipline considerably. Token-based, query-based or agent-based pricing introduces variable costs that scale with adoption: precisely the scenario your value case has to anticipate. A vendor priced at ‘free for the first 10,000 queries’ looks rather different at 100,000. If your value driver tree doesn’t connect AI usage to business benefit, you can’t reason about whether consumption costs are justified. The flip side is that consumption metrics also make benefit attribution easier: if you can count queries, you can count value per query. Both halves of the ROI calculation become more measurable.
The decisive technique: data tests
AI brings a problem that traditional software doesn’t: even its creators often can’t fully explain how it works. Anthropic’s own research on large language models (Tracing Thoughts in Language Models) acknowledges that the internal mechanisms remain stubbornly opaque even to the people who built them. For HR, where you might one day need to explain to a rejected candidate or a passed-over employee why the AI said no, that opacity matters. ‘The AI said no. We’re not entirely sure why, but we trust it’ isn’t a conversation that goes well.
Data tests are the most useful way around the black box. They give the vendor a known dataset, ask them to process it with their AI and compare outputs across vendors against criteria defined in advance. Better than vendor demos. More revealing than RFP responses. More practical than POCs in early evaluation.
Designing a good data test:
- Use anonymised employee data from your own organisation where possible. Synthetic data is acceptable when real data can’t be shared, but tests less faithfully.
- Use the same dataset across all vendors. Without identical inputs you can’t compare outputs; vendor-supplied data favours the vendor.
- Include edge cases deliberately. Unusual job titles, payroll exceptions, ambiguous policy questions, non-linear careers. Edge cases reveal model brittleness.
- Probe for bias. Include cases designed to surface bias across protected characteristics, where you can do so lawfully.
- Evaluate accuracy and explanation. Score not only whether the AI got it right, but whether it can tell you why.
- Run identical tests repeatedly. Generative AI can produce different outputs from identical inputs. For candidate scoring or policy interpretation, output instability is itself a governance issue.
Used well, data tests cut weeks out of evaluation and remove the influence of vendor demo polish.
Designing around vendor pushback
Vendors push back on data tests, and not always unreasonably. They’re protecting IP exposure, sales cycle timing and demo control. The trick is to design the test in a way that neutralises their concerns: narrow the scope (two scenarios, four hours of their time), use their own sandbox or pre-prod environment with synthetic or anonymised data, set the test as an RFP entry condition rather than a late ask and pre-clear data handling via NDA. Phase the depth as well: scripted demos for all vendors, data tests for the shortlist, POCs for the preferred vendor only. Each stage costs vendors more, so only the serious survive. Position the test as standard enterprise AI procurement practice, not a bespoke favour.
If a vendor refuses every form of data validation, that’s your test result. Treat it as a hygiene failure, not an inconvenience. Customer references who ran their own validation are a workable substitute when direct testing fails: less rigorous, but better than the demo alone.
Supporting evidence for validation, audit and testing
Buyer-side AI data tests are rarely public, because enterprise procurement is confidential by nature. But the case for systematic validation, audit and testing of AI in HR is well-supported by adjacent evidence across four categories:
- Regulatory requirement. NYC Local Law 144 mandates pre-deployment bias audits and candidate or employee notices for automated employment decision tools used in NYC hiring.
- Enforcement precedent. The EEOC settlement with iTutorGroup (2023) confirmed AI-driven hiring discrimination is actionable in the US. Mobley v. Workday, the first major private lawsuit alleging algorithmic bias in HR vendor software, signals that vendor liability for embedded AI is now in play.
- Internal case study. Amazon’s 2018 decision to scrap its recruiting AI followed internal bias testing: a documented example of validation revealing a fundamental problem.
- Procurement and risk guidance. The NIST AI Risk Management Framework, the GSA AI Buying Guide and the WEF AI Procurement Guidelines all endorse systematic testing in AI procurement.
None of these is a clean precedent for buyer-side data testing during selection. They support the underlying principle rather than the specific practice. The absence of named enterprise case studies is itself a finding: most buyers who run rigorous tests don’t publicise them.
POCs and pilots for immature AI features
For genuinely novel AI capabilities, agentic workflows in particular, data tests may not be enough. The next step is a proof of concept (POC) or pilot.
A POC is a cut-down version of the solution, with limited configuration and test data, running outside production. It lets buyers experience the AI hands-on with their own people and processes, without committing to deployment.
A pilot is a cut-down version of the production system, with real users, live data and some integrations. Pilots are typically run with one vendor only and follow vendor selection.
Both are time and resource intensive, and both carry a specific trap: POCs that drift into production without proper due diligence on hygiene requirements. I’ve seen this happen more than once. If you run a POC, run it deliberately, with success criteria, a clear end date and a decision rule that returns you to the formal selection process at the end.
AI governance: a buyer-side responsibility
Responsible AI isn’t something the vendor delivers. It’s something the buyer governs. Organisations using AI in HR should develop governance, policies and guardrails specific to HR applications, ideally before vendor selection.
At minimum, your AI governance should cover:
- Model risk classification. Which AI use cases are high-risk (recruiting decisions, performance management) versus lower-risk (intelligent search, summarisation)? Different risk tiers warrant different controls.
- Human-in-the-loop policy. For which AI outputs is human review required before action? Who is the human, and what is their training?
- Bias monitoring. How is bias measured in production, how often and who is accountable when it is found?
- Escalation paths. When the AI gets it wrong, where does the case go and how is it resolved?
- Employee opt-out and transparency. How are employees informed that AI is being used in decisions that affect them? Where applicable, how do they opt out?
- Model change management. How do you handle vendor-side model updates that may change AI behaviour mid-contract?
This list assumes an organisation with the capability to design and operate these controls. In practice, most HR functions, procurement teams and legal departments are still building their AI maturity. Acknowledging this honestly is part of buyer-side governance. Many organisations will need external support, whether through internal AI committees, external counsel or specialist advisors, to establish proportionate controls. The governance challenge is organisational as much as technical.
Governance work belongs in Phase A and Phase E of the SelectionWise method. Define it before procurement, and have it operational before the AI goes live. Vendors will help with conformance documentation but they can’t give you governance. That’s yours.
Employee buy-in, adoption and change management
AI adoption in HR is as much a change management challenge as a technology one. Many AI failures in HR won’t be technical failures. They’ll be buy-in failures, adoption failures, cultural failures or industrial relations failures. The risk concentrates in the use cases with the highest stakes for individual employees: performance management, internal mobility, workforce planning and recruitment scoring.
Stakeholder management here is broader than the buying team. Ask: can the AI explain its outputs in language an affected employee would understand and accept? What happens when an employee challenges an AI recommendation? Where there are unions or works councils, have they been engaged on the proposed use cases? In some jurisdictions, that engagement is a legal precondition, not good practice.
Perceived fairness matters as much as measured fairness. An AI tool that’s technically unbiased but feels opaque to employees will erode trust and harm adoption. Build employee transparency into the selection criteria, not as a compliance afterthought.
Regulatory snapshot
AI regulation affecting HR is one of the fastest-moving regulatory areas in technology. The named laws below are correct at the time of writing, but the picture changes quickly. Treat this as a snapshot, not a current legal position.
- EU AI Act. Many HR uses of AI, including recruiting, performance management and workforce allocation, are likely to be classified as high-risk under Annex III. Classification depends on use case: not all HR AI sits in the high-risk category, and some obligations fall on deployers rather than vendors. High-risk systems carry conformity, transparency and human oversight obligations. Implementation is phased through 2026 and detailed guidance continues to emerge.
- UK position. The UK is in active legislative motion on AI. Existing UK GDPR and Equality Act obligations, ICO guidance on automated decision-making and EHRC guidance on AI and discrimination form the practical floor. UK organisations selling into the EU still face EU AI Act obligations.
- NYC Local Law 144. Requires bias audits and candidate notification for automated employment decision tools used in New York City hiring. Has set a precedent for similar laws in other US jurisdictions.
- GDPR Article 22. Restricts solely automated decision-making, including profiling, that produces legal or similarly significant effects. HR decisions frequently fall within scope. Buyers should plan for human oversight as a default.
- US state laws. Illinois (AI Video Interview Act, in effect since 2020), Colorado (the Colorado AI Act and proposed replacement legislation are moving quickly, with implementation dates and obligations still shifting), California (various) and others. New laws appear regularly. Legal sign-off on cross-border AI use is essential.
This section is a snapshot. The picture changes faster than this page can be kept current. Confirm the position with specialist counsel before acting on any specific obligation.
Capture AI commitments contractually
I’ve written elsewhere about how much of what gets demonstrated and promised during a sales process is contractually invalid: presales information is normally deemed inadmissible, and vendors typically resist incorporating RFP responses as binding. For AI capabilities specifically, this is a particularly expensive gap.
AI-specific commitments to capture in the contract:
- Shipped capability definition. What AI features are shipped today, scored in evaluation and included in the price.
- Model versioning rights. Notification before material model changes, with right to test before activation.
- Performance commitments. Where the AI was scored on accuracy or bias metrics in evaluation, capture target performance and remedies for material regression.
- Data usage rights. Whether your data may be used for vendor model training (default should be no, with opt-in).
- Opt-out provisions. Right to disable AI features and revert to non-AI processing without penalty.
- AI roadmap commitments. Where roadmap features factored into the selection decision, get them written into the contract with delivery dates and remedies.
Don’t accept ‘we’ll send you a notice’ as a substitute for contractual commitments. AI moves quickly. Contracts last five years.
AI and platform lock-in
Modern HR platforms are no longer single applications. They’ve become data layer, workflow layer, AI layer and orchestration layer combined. AI accelerates the depth of lock-in because adoption embeds the platform into daily operational behaviour in ways that previous SaaS lock-in didn’t.
Switching costs increase as the following accumulate inside a platform: configured prompts and prompt libraries, custom automations, AI workflow chains, embedded copilots, agent permissions, training feedback loops and the muscle memory of users who’ve learned the AI’s quirks. Replacing the platform replaces all of it.
In contract negotiation, push for portability commitments specific to AI:
- Prompt and configuration ownership. Custom prompts, prompt libraries and AI workflow configurations are your IP, exportable at contract end in a usable format.
- Data exportability. Including training feedback data, AI interaction logs and audit trails - not just the underlying HR records.
- Hyperscaler dependency disclosure. Where the vendor’s AI relies on a third-party model (OpenAI, Anthropic, Google, AWS Bedrock), understand the contractual exposure if those relationships change.
- Transition assistance. AI-specific transition assistance at contract end, not generic SaaS exit.
Buyers who treat AI as a product feature rather than an embedded layer will be surprised by switching costs in five years. The discipline now is to contract for the exit while you still have leverage.
How this fits the SelectionWise method
AI evaluation runs across the full SelectionWise lifecycle. The toolkit provides the templates, checklists and AI accelerators to operationalise it at each phase.
A quick note on AI on both sides of the table. AI isn’t just what you’re buying; it can also be a tool that helps you buy well. It can generate value driver trees, draft RFP documents from your requirements, analyse vendor responses and summarise reference calls. AI evaluation and AI-assisted evaluation are two sides of the same selection.
- Phase A - Know What You Want. Define the problem first; decide whether AI is part of the solution second. A clear solution definition and a value driver tree anchor the AI question in business value.
- Phase B - Selection Preparation. Apply triage to AI capabilities. Build the AI-specific evaluation list. A structured requirements triage and an evaluation list make this concrete.
- Phase C - Vendor Selection. Scripted demos, data tests, hygiene assessment, reference checks. Data tests in particular carry the most weight for AI features.
- Phase D - Implementation Partner Selection. Your SI needs AI implementation experience, not just product experience. Ask differently.
- Phase E - Readiness & Contracting. AI governance operational. Contractual commitments captured. Implementation readiness assessed. Contracts signed only when ready to implement.
After go-live: operational measurement
AI evaluation doesn’t end at contract signature. The framework needs operational measurement to confirm that the AI is delivering the value case and behaving as expected in production. Set these up before go-live, not after.
- Adoption rates. Are users actually engaging with the AI features, by user group, location and use case?
- Override and acceptance rates. How often do users follow the AI’s recommendation? Patterns by population segment matter as much as the headline number.
- False positive and false negative tracking. Particularly in scoring and screening use cases. Measure both, not just accuracy in aggregate.
- Output variance. For generative features, sample outputs over time to detect drift or regression following vendor model updates.
- Escalation frequency. How often does the AI escalate to a human, and what is the human response pattern? Spikes indicate model issues; flat lines indicate the AI is being trusted blindly.
- Realised value against the value case. Quarterly review against the value driver tree. AI that does not move the metrics it was selected for is a sunk cost in disguise.
Build the measurement plan during Phase B. The metrics you need post go-live are the metrics you should be using to evaluate vendors during selection.