How we evaluate models for GitHub Copilot


There are so many AI models to choose from these days. From the proprietary foundation models of OpenAI, Google, and Anthropic to the smaller, more open options from the likes of Meta and Mistral. It’s tempting to hop immediately to the latest models. But just because a model is newer doesn’t mean it will perform better for your use case.

We recently expanded the models available in GitHub Copilot by adding support for Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s o1-preview and o1-mini. While adding models to GitHub Copilot, performance, quality, and safety was always kept top of mind. In this article, we’ll share some of the GitHub Copilot team’s experience evaluating AI models, with a focus on our offline evaluations—the tests we run before making any change to our production environment. Hopefully our experience will help guide your own evaluations.

What is AI model evaluation?

AI models are systems that combine code, algorithms, and training data to simulate human intelligence in some way. GitHub Copilot gives users the choice of using a number of AI models trained on human language, which are known as large language models, or LLMs. OpenAI’s models are some of the most well-known LLMs due to ChatGPT’s popularity, though others such as Claude, Gemini, and Meta’s Llama models are increasingly popular as well.

AI model evaluation is the assessment of performance, quality, and safety of these models. Evaluations can include both automated tests and manual use. There are a number of frameworks and benchmarking tools available, but we generally build our own evaluation tools.

Automated tests enable evaluation at scale. You can see how well a model performs on a large number of tasks. The downside is that you tend to need some objective criteria to evaluate the output. Manual testing enables more subjective evaluations of output quality and accuracy, but is more time intensive. Combining the two approaches enables you to get a sense of the subjective quality of responses without relying entirely on anecdotal evidence or spending an enormous amount of time manually evaluating thousands upon thousands of responses.

Automating code quality tests

We run more than 4,000 offline tests, most of them as part of our automated CI pipeline. We also conduct live internal evaluations, similar to canary testing, where we switch a number of Hubbers to use a new model. We test all major changes to GitHub Copilot this way, not just potential new models.

We’ll focus here on our offline tests. For example, we evaluate potential models by their ability to evaluate and modify codebases. We have a collection of around 100 containerized repositories that have passed a battery of CI tests. We modify those repositories to fail the tests, and then see whether the model can modify the codebase to once again pass the failing tests.

We generate as many different scenarios as we can in different programming languages and frameworks. We’re also constantly increasing the number of tests we run, including using multiple different versions of the languages we support. This work takes a lot of time, but it’s the best way we have to evaluate a model’s quality.

Using AI to test AI

GitHub Copilot does more than generate code. Copilot Chat can answer questions about code, suggest various approaches to problem solving, and more. We have a collection of more than 1,000 technical questions we use to evaluate the quality of a model’s chat capabilities. Some of these are simple true-or-false questions that we can easily evaluate automatically. But for more complex questions, we use another LLM to check the answers provided by the model we’re evaluating to scale our efforts beyond what we could accomplish through manual testing.

We use a model with known good performance for these purposes to ensure consistent evaluations across our work. We also routinely audit the outputs of this LLM in evaluation scenarios to make sure it’s working correctly. It’s a challenge to ensure that the LLM we use for evaluating answers are aligned with human reviewers and perform consistently over many requests.

We also run these tests against our production models every day. If we see degradation, we do some auditing to find out why the models aren’t performing as well as they used to. Sometimes we need to make changes, for example modify some prompts, to get back to the quality level we expect.

Running the tests

One of the cool things about our setup is that we can test a new model without changing the product code. We have a proxy server built into our infrastructure that the code completion feature uses. We can easily change which API the proxy server calls for responses without any sort of change on the client side. This lets us rapidly iterate on new models without having to change any of the product code.

All of these tests run on our own custom platform built primarily with GitHub Actions. Results are then piped in and out of systems like Apache Kafka and Microsoft Azure, and we leverage a variety of dashboards to explore the data.

Making the call to adopt or not

The big question is what to do with the data we’ve collected. Sometimes a decision is straightforward, such as when a model does poorly across the board. But what if a model shows a substantial bump in acceptance rates, but also increases latency?

There are sometimes inverse relationships between metrics. Higher latency might actually lead to a higher acceptance rate because users might see fewer suggestions total.

GitHub’s goal is to create the best quality, most responsible AI coding assistant possible, and that guides the decision we have to make about which models to support within the product. But we couldn’t make them without the quality data our evaluation procedures provide. Hopefully we’ve given you a few ideas you can apply to your own use cases.

Build generative AI applications with GitHub Models

GitHub Models makes it simple to use, compare, experiment, and build with AI models from OpenAI, Cohere, Microsoft, Mistral, and more.

Start building now >

Written by

Blog Article: Here

  • Related Posts

    CUDA Accelerated: How CUDA Libraries Bolster Cybersecurity With AI

    Editor’s note: This is the next topic in our new CUDA Accelerated news series, which showcases the latest software libraries, NVIDIA NIM microservices and tools that help developers, software makers and enterprises use GPUs to accelerate their applications. Traditional cybersecurity measures are proving insufficient for addressing emerging cyber threats such as malware, ransomware, phishing and
    Read Article

    Agentic AI Leaders to Showcase Latest Advancements at NVIDIA GTC

    From improving customer experiences to boosting operational efficiency, agentic AI — advanced AI systems designed to autonomously reason, plan and execute complex tasks based on high-level goals — is changing the way people live and work. Across industries, AI agents serve as intelligent personal assistants, helping automate repetitive and time-consuming tasks to free up individuals
    Read Article

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    You Missed

    Boosting your reach and performance with Google Display Ads

    Boosting your reach and performance with Google Display Ads

    Analyze your Google Sheets data faster with this AI update.

    Analyze your Google Sheets data faster with this AI update.

    Salesforce Prescribes Agentforce for Health to Speed Time to Treatment and Improve Outcomes with Digital Labor

    Salesforce Prescribes Agentforce for Health to Speed Time to Treatment and Improve Outcomes with Digital Labor

    How Agentic AI Will Ease Healthcare’s Workforce Crisis

    How Agentic AI Will Ease Healthcare’s Workforce Crisis

    AI Agents Can Cut Healthcare Paperwork by 30%, Study Shows

    AI Agents Can Cut Healthcare Paperwork by 30%, Study Shows

    Precina Set to Deliver 24/7, Personalized Diabetes Care Nationally with Agentforce

    Precina Set to Deliver 24/7, Personalized Diabetes Care Nationally with Agentforce