Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement

1State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University
2School of Psychological and Cognitive Sciences, Peking University
3Key Laboratory of Machine Perception (Ministry of Education), Peking University
4PKU‑Wuhan Institute for Artificial Intelligence
LLM Psychometrics Logo
Overview of LLM Psychometrics Framework

Figure 1: Overview of our survey on LLM Psychometrics.

Abstract

The rapid advancement of large language models (LLMs) has outpaced traditional evaluation methodologies. It presents novel challenges, such as measuring human-like psychological constructs, navigating beyond static and task-specific benchmarks, and establishing human-centered evaluation. These challenges intersect with Psychometrics, the science of quantifying the intangible aspects of human psychology, such as personality, values, and intelligence. This survey introduces and synthesizes an emerging interdisciplinary field of LLM Psychometrics, which leverages psychometric instruments, theories, and principles to evaluate, understand, and enhance LLMs. We systematically explore the role of Psychometrics in shaping benchmarking principles, broadening evaluation scopes, refining methodologies, validating results, and advancing LLM capabilities. This paper integrates diverse perspectives to provide a structured framework for researchers across disciplines, enabling a more comprehensive understanding of this nascent field. Ultimately, we aim to provide actionable insights for developing future evaluation paradigms that align with human-level AI and promote the advancement of human-centered AI systems for societal benefit. A curated repository of LLM psychometric resources is available at https://github.com/valuebyte-ai/Awesome-LLM-Psychometrics .

Comparison: Psychometrics vs AI Benchmarks

Table 1: Systematic comparison between psychometric evaluation and conventional AI benchmarking approaches.
Feature Psychometrics AI Benchmark
Core goal To measure psychological constructs, to prove that a test measures as intended (validity evidence), and to understand the construct being measured. To test and compare the task performance of different LLMs. Focuses on ranking models and selecting the best one suited for a specific task.
Philosophy of measurement Construct-oriented. Tends towards a causal approach to measurement, where the measured trait is believed to cause the measurement outcomes. Task-oriented. Leans towards representativism, assuming items exhaust or represent all aspects of the underlying ability.
Target construct Personality and ability. Mostly task-specific abilities.
Construct definition Emphasizes clear and detailed definitions of the construct being measured. Agreement on the construct definition is a byproduct of test development. Often defines constructs implicitly through ad hoc task selection. Construct definitions can be vague.
Development process Systematic and rigorous, often following methods like Evidence-Centered Design (ECD). Can be labor-intensive. Compiles a set of relevant questions or tasks, then performs expert annotation or crowdsourcing to label ground truth answers. Less labor-intensive per item.
Number of items Can vary, but not necessarily large. Focus is on item quality and relevance to the construct. Typically consists of an extensive number of questions to cover various aspects of abilities. Reliability increases with test length.
Sample size Typically requires a larger sample size of test takers for robust statistical modeling. Can be applied to evaluate the performance of a single LLM on the benchmark.
Statistical modeling Employs advanced and various statistical models like Item Response Theory and Factor Analysis to analyze data, estimate latent abilities, and assess model fit. Often relies on simple aggregation methods, such as calculating average accuracy across benchmark tasks.
Result analysis Ensures the reliability, validity, predictive power, and explanatory power of the test through result analysis and statistical modeling. Reliability is likely to be high due to the large number of items. However, validity, predictive power, or explanatory power beyond the target task is not a primary concern.

Measuring Psychological Constructs

Examples of psychometric tests for LLMs

Figure 2: Examples of psychometric tests for LLMs, showing both personality (left) and cognitive (right) evaluations.

Psychological Constructs in LLM Research

LLM psychometrics evaluates LLMs in their personality and cognitive constructs. Personality constructs include (1) personality traits based on theories such as Big Five, HEXACO, MBTI, or Dark Triad; (2) values based on theories such as Schwartz, WVS, VSM, and GLOBE; (3) morality based on MFT, DIT, and ETHICS; and (4) attitudes and opinions from political panels like ANES, ATP, GLES, and PCT. In contrast, cognitive constructs include (1) heuristics and biases measured by tasks such as the Cognitive Reflection Test; (2) social interaction abilities—Theory of Mind, Emotional and Social Intelligence; (3) psychology of language covering comprehension, generation, and acquisition; and (4) learning and cognitive capabilities.

Evaluation Methodology

Psychometric Evaluation Methodology

Figure 3: Overview of LLM psychometric evaluation methodology, including test formats, data sources, prompting strategies, model outputs, and scoring mechanisms.

Evaluation Methodologies for LLM Psychometrics

LLM psychometrics mirrors a classic testing pipeline in some aspects but is more tailored to LLMs. Test formats can range from tightly controlled structured items (forced-choice or Likert) to open-ended conversations and full agentic simulations. Data sources may come from established inventories, custom-curated adaptations, or synthetic prompts automatically generated to extend test coverage. Prompting strategies include perturbing the original question and injecting performance-enhancing or role-playing instructions. Finally, output and scoring modules translate the model's raw text into numerical metrics: logit-based analysis and direct scoring for closed-ended outputs, or rule-based, model-based, or human evaluation for open-ended LLM outputs.

Psychometric Validation

Overview of psychometric validation approaches

Figure 4: Overview of psychometric validation: reliability and consistency, validity, and standards and recommendations.

Validation Framework for LLM Psychometrics

Applying psychometrics to LLMs requires validation. Reliability is assessed through test-retest reliability, parallel forms reliability, and inter-rater reliability when subjective coding is involved. Validity evidence is gathered on multiple fronts: content (guarding against training data contamination or item under-representativeness), construct (ensuring responses reflect the intended latent trait rather than confounding factors such as response sets or social desirability bias), and criterion or ecological correspondence with external benchmarks. We also gather emerging standards, such as non-disclosure of test materials, fairness across languages and cultures, and the suitability of tests for model capabilities.

LLM Enhancement Techniques

Psychometrics also serves as powerful tools for model enhancement across three key domains:

  • Trait Manipulation: Controlling LLM traits through prompting, inference-time interventions, and fine-tuning.
  • Safety and Alignment: Leveraging psychometrics to guide LLM value alignment and improve safety.
  • Cognitive Enhancement: Developing stronger or more human-like reasoning, empathy, and communication capabilities.

Future Directions

Our survey identifies several emerging trends, challenges, and future directions for LLM psychometrics research:

  • Psychometric Validation: Establish rigorous reliability and validity checks.
  • From Human Constructs to LLM Constructs: Tailor psychological constructs for LLMs.
  • Perceived vs. Aligned Traits: Distinguish between traits that humans perceive from LLM outputs and those aligned with human self-views.
  • Anthropomorphization Challenges: Properly anthropomorphizing LLMs in psychometric tests remains a subject of academic debate.
  • Expanding Dimensions in Model Deployment: Extend evaluations to multilingual, multi-turn, multimodal, agent, and multi-agent contexts where new validity issues emerge.
  • Item Response Theory: Adopt IRT models to improve LLM evaluation.
  • From Evaluation to Enhancement: Leverage psychometrics to enhance and align LLMs.

Citation

@article{ye2025large,
  title={Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement},
  author={Ye, Haoran and Jin, Jing and Xie, Yuhang and Zhang, Xin and Song, Guojie},
  journal={arXiv preprint arXiv:2505.08245},
  year={2025},
  note={Project website: \url{https://llm-psychometrics.com}, GitHub: \url{https://github.com/ValueByte-AI/Awesome-LLM-Psychometrics}}
}