We gather data from code reviews completed within your organization and normalize them in a way that you can see meaningful trends and statistics. Many of the tools that we use for these calculations we have already been using internally as a part of our pipeline for assessing and improving the quality of our PullRequest code reviewers.
Now we are making this data available to select teams for assessing and tracking teams' performance over time.
On an aggregate, we filter out outliers within the data to help make the data as accurate and meaningful as possible. For example, pull requests that have been open for months and are finally merged are generally excluded, as are pull requests that have 0 lines of change or are merged within a matter of seconds. Similarly, we exclude pull requests that contain mostly code that was already reviewed within another pull request. This is common for deployment flows where changes are reviewed in smaller reviews and then merged into one larger review for deployment.
Your organization's Benchmark scores are measured and ranked against anonymized datasets gathered from other teams that most closely resemble your own.
All development teams, regardless of size or industry vertical, strive to release features at a high velocity, write quality, maintainable code, and prevent issues such as bugs from getting deployed into production. However, depending on the stage and size of the company, the balance of priority can shift in one direction or another. These variants have a significant, long-term influence on development lifecycle patterns.
For example, larger companies tend to focus on quality and reducing defects in production code. Where smaller companies, such as early-stage startups, tend to focus first and foremost on things like shipping new features to fulfill core milestones - in turn, investing less time in preemptively catching code-level issues. And the larger a team is, the more likely the team is to have well-established code review processes.
We group your organization alongside similar organizations based on a number of factors. Your team's Benchmark scores are positioned within those of the larger, collective cohort. Years of research and analysis have gone into designing PullRequest's cohort-assigning systems; enhancements are made to it on a daily basis. Though every development team is unique, we aim to make sure your team's scores are benchmarked against the broader industry as accurately as possible.
Anonymized datasets from hundreds of teams, thousands of pull requests, and millions of lines of code are assessed daily as part of these efforts. Code review processes are inherently human, and no two are the same. The PullRequest team has invested a great deal of time and research to deriving these metrics and finely tuning them to be concise, actionable, and guarded against factors likely to pollute data or inject noise.
Further, Benchmark scores are normalized such that the metrics are consistently relevant regardless of things like fluctuations in the volume of code committed and changes in team size or composition.
Below are the Benchmark score provided by PullRequest along with explanations of what they mean and the important insights you can draw from them.
An "issue" is cited as an inline comment left by a reviewer of the pull request that either resulted in a change to the pull request prior to getting merge or provided useful insight to the code author. This excludes things like automated messages from bots, as well as replies to existing comments. The data for this statistic is normalized to a count of issues caught per 1,000 lines of change in a pull request that ended up being merged. Similar to the Code Review Size metric below, this accounts for meaningful lines of change, filtering out generated files so that it is focused on the number of lines actually being reviewed.
This metric is meant to approximate the number of effective comments left on code reviews that result in an iterative improvement to the code or product.
This is a much more meaningful statistic than observing a raw number of comments posted per pull request or the number of pull requests created, though it is helpful to account for these in relation to it. It encourages behavior to review code thoroughly and post thoughtful, high-value feedback.
Perhaps contrary to what the verbiage implies, a greater number of issues caught is desired. This is an indicator of how much value you are deriving from your code review process. This is because issues and improvement opportunities identified before the code is merged means that fewer bugs will reach production and a team will spend less time addressing issues in the future in a costly, reactive pattern. The score incentivizes making sure changes are thoroughly assessed by experienced reviewers, and for your developers to become better reviewers.
The size of a code review is determined by the number of lines of change within each reviewed pull or merge request. This statistic provides information on the relative cognitive load on reviewers based on how complicated the changes being proposed are and if changes to a project are introduced with appropriately sized iterations.
In order to more accurately portray this than just a raw line count, PullRequest filters out sections of code that result from generated files or other file types that are not commonly reviewed. Otherwise, the measurement would be largely skewed by things like
package-lock.json files that are generated from npm and can have thousands of lines of change which are often trivial to review.
We also filter out test files being modified or added within the review, as our extensive research has shown that the addition of tests alongside functional implementation increase line counts but does not impact cognitive load. In fact, the addition of tests helps reviewers understand proposed changes more effectively than if they were excluded.
This metric is meant to be an indicator of the average cognitive load on reviewers that are doing code reviews within your organization. Numerous studies have found that the cognitive load increases with the number of lines of change in a code review at an exponential factor. Essentially, the more lines of reviewable code involved in a review means more variables and possible interactions with other functionality must be taken into account. Increased cognitive load results in two key negative impacts on your code review process: increased latency in reviewer response, and decreased thoroughness or issues caught prior to merge (fatigue effects).
Code review requires context shifting for a reviewer to understand the code changes. The larger the review, the more time that this context shift takes and will decrease the reviewer's ability (and desire) to quickly review or follow up on changes. Similarly, the larger the changeset, the less likely a reviewer will notice a small syntax or logic error among the thousands of other additions, deletions, and modifications. This results in more bugs sneaking past code review into production or QA lifecycles.
This is the average duration that a code review is in progress from open to merge. Relative to other Benchmark metrics, it is a relatively straightforward metric to calculate. PullRequest's advanced filtering removes extreme outliers of notably short or notably long lifecycles. Occasionally abandoned pull requests, open for months and eventually merged, would otherwise a pollute dataset. Similarly, extremely short pull request lifecycles, which may not require review or have been reviewed outside of a proper pull request workflow, would misleadingly skew a dataset.
This metric can be used to measure the efficiency of your code review feedback loop. In conjunction with a healthy Issue Catch Rate, the shorter that this duration is, the better. Both the author and the reviewer will require less time re-familiarizing themselves with the proposed changes. This will also mean that features and bug fixes are able to get from development to production faster without excess time spent in the code review process.
This is the average time that it takes for a code reviewer to leave their first round of feedback to the owner of the code changes after the pull request has been created. This metric focuses on the human-first response, excluding things like automated comments left by bots.
This metric can be used in combination with the Lifecycle Duration metric to determine the overall "speed" of a code review process. While the Lifecycle Duration score measures the total time of review from start to end, the Reviewer Response Latency score the critical first round of feedback. Both are equally important in improving a development team's velocity. The Lifecycle Duration score includes a number of other factors, such as if many rounds of feedback and follow-up changes are required which may in turn demand more time and result in an increase in score. Where Reviewer Response Latency measures speed/agility within the review lifecycle.
This is the percentage of meaningful lines of code being modified in code reviews that are in test files. This excludes lines from generated code or files that are not usually reviewed by a human similar to other metrics utilizing line counts.
This metric can help give an estimation of how well-tested the changes being proposed are. There are other metrics, such as code coverage, that can also provide good information regarding code coverage estimates. Similar to code coverage scans, it does not provide a 100% guarantee that the tests are effectively testing all of the edge cases. Rather, it provides useful insight into the efforts spent testing. Even some of the best development teams tend to benefit from increasing this percentage. Not only does it help to increase the quality of your product, but it can speed up your code review process. Our research has shown that reviewers will complete a review much faster and cite more issues prior to merge if tests are added alongside functional updates to exercise the modified or new code.