The old saying goes, “A picture is worth a thousand words”. This can certainly be said for optimization test results when you have the opportunity to view major KPI’s over time in a graph, rather than just calculating a lift and significance at a given point-in-time. We’ll use a standard A/B test scenario as an example to illustrate the importance of observing data in a graphical format to determine trends and look for learnings.
There are a number of points to consider when evaluating a test’s results in a visual format such as a line graph:
Directional trends versus static/random trends
Do you see a clear difference between the performance of your test versus control? Your evaluation may not show significance, but if the delta between test and control is increasing or decreasing over time, your significance will also change with time. If your results seem random day to day (up one day and down the next) you may have a lot of variability between your groups and you may need more time to run the test (see Fig. A). There are several tools available online that can help you determine the length of time and size of your populations for a given test, depending on the lift you want to measure with significance. At the end of the day, you may just not be moving the needle enough with your test idea to capture the impact in your given test time-frame.
If your data seems skewed in one direction that doesn’t make sense, take a look at the daily data points to see if there are any outliers that have been introduced into the evaluation. Outliers are data points that fall outside of normal variation, and if you determine they are legitimate you’ll need to decide whether to keep them or discard them from the analysis. Are they likely to show up again? Maybe you had a group of purchases in either your test or control that are not representative of ‘normal’ sales. Maybe bad weather hurt sales at a location that makes up the majority of your test or control group (see Fig. B). While they may be legitimate data points, if you can measure the impact with and without them you can determine their influence on the overall results and make a business decision on whether to keep them in the analysis.
Immediate impact versus ramp-up
Testing anything that requires time before expecting an impact – like customers adjusting to a new booking flow, word to spread on a new product, training to get up to speed with new procedures (think sales training) – can have an impact on test results that might not be apparent initially. If you anticipate a ramp-up before expecting to see positive results in your test group, you should monitor the trend over time and determine if metrics are continuing to improve or have plateaued at some point. You may even observe an initial downward trend if the changes being tested are dramatic enough to disrupt normal sales initially. If we evaluated the test on 2/7, (noted by the green line in Fig.C), we would most likely determine the test was under performing and stop it. When you view the graph, you can see that there was an upward trend that started around 2/1 indicating the test was beginning to have a positive effect. Therefore, this additional information might lead you to let the test continue, ultimately resulting in a winning test at 2/27!
Any decisions on stopping or extending a test should always include an evaluation of the impact on your business. Is the test important enough to let it run longer or should we stop it and move on? Have we learned enough to make a decision or do we really need more information before making a call? For example, if you determine a test is hurting your KPI but isn’t significant, do you really need to keep it running longer just to claim significance? If it’s unlikely to turn the corner and become positive then why let it continue? Do we have other pressing ideas that are waiting with equal or more potential being held up by this test?
Rarely do we see a test that has an immediate delta between test and control and doesn’t fluctuate over time. What we want to see is a clear indication that the test is either helping or hurting the main KPIs and ultimately the business. If we can say this with confidence, then we can decide if there is value in letting the test run longer to gather additional data and claim more accuracy in our estimates of impact. Tests that are in the uncertain area (neutral with no clear winner or loser) are harder to evaluate and we often need to decide when it’s time to move on rather than continuing to collect data that has been inconclusive. The key is to observe the data and understand the trend (no apparent trend, overall positive, overall negative, or specific time periods when metrics shifted). This will give you much more valuable information than just a simple metric with an attached significance for making your business decisions.
Latest posts by Will Plusch (see all)
- Visualizing test data vs. point-in-time measurements – 1/February/2019
- Adding a caution message – 5/April/2018