Haystack pulls data from version control systems. Version control is a system with long-tail distributions. Some of these long tails might be bottlenecks in your system, however, some might be just not so important tasks related to your business skewing your data drastically. We call the latter ones outliers. In this document, we'll go over the best practices to handle outliers.
I'd like to give you 2 types of outliers we see the most often.
Pull requests with change lead time > 1 year
Pull requests with LoC > 20000
In both these situations, we need to make sure if the pull request delivers proportionate business value or not while affecting the data. This is a subjective element thus requires manual action.
The first issue usually happens when there is an idle pull request like a configuration change. 4 lines changed, nobody reviewed the pull request for years, then it was merged. Suddenly we have a huge spike in our change lead time, indicating we are doing 10x worse than before. In reality, we're doing exactly the same with a long tail data point skewing our data.
The spike is caused by an idle pull request merged after 2 years
Second issue usually happens when a developer pushes database seed data, huge json files, auto-generated files, linting the codebase, etc. These tasks provide business value but it skews the average in a disproportionate manner.
For both these cases, we recommend going to filters page and excluding pull requests by labels or terms.
The best practice Haystack recommends is adding a Github label named haystack-ignore
. Whenever there is an outlier, add this label at Github allowing multiple teams to have a consistent way to manage outliers.
If you have questions contact [email protected] and we can help you with your outliers.