Research

Model Evaluation

2 articles in archive

Advancing model performance and real world evaluation in applied domains.

OpenAI Blog345d ago

We’re releasing a human-validated subset of SWE-bench that more reliably evaluates AI models’ ability to solve real-world software issues.

OpenAI Blog584d ago