Describe your background in data analysis.
In high school, I took a Statistics class and the biggest takeaway for me was the teacher saying “every field needs statistics so you can pretty much do anything you want with this background.” When I heard that, I was like “sweet!” I didn’t really know what I wanted to do yet, so it sounded like a perfect bridge for anything I wanted to do.
I’ve worked in three different industries: manufacturing, entertainment, and travel. I worked in manufacturing at Michelin and did a lot of A/B testing, experimental design, and quality control. Then, I moved on to the Walt Disney Company where I analyzed their customer behavior, like when people are traveling, what resorts are they preferring, that kind of stuff – to help inform more of the business side of the company.
Now, I’m at Fusion where it’s still in the travel industry, but more of a narrow focus working with the airline business. And in each one of those roles, they all had questions and data was the focus to answer these questions.
Tell us a little more about your role at Fusion.
Sure. My role here started out as a pure statistician. I was a data miner looking for trends and patterns within the airline industry and the data that we receive here — very investigative. Eventually, the focus was narrowed to analyzing variants that we tested and trying to find “golden nuggets” within the test. Sometimes, when we run a variant, overall it appears that it failed, but I was able to find pockets of customers where the variant actually did pretty well. We could then create business rules to only show that particular variant to a smaller group. A good example of this is a variant that is geared towards upselling a VIP lounge. For some airlines, the business customers are the target group for this product, but they could represent a much smaller population than the leisure customers. So, if this variant did not perform well overall, it may be because the leisure customers are outweighing the business customers and pulling the overall result down. However, if we just look at the business customers, the variant did really well and helped upsell the VIP lounge product. We can then build new segments based on rules to define the business customers and promote this variant just to that group.
I have now grown into the role of the Director of Business Analysis, with my team gathering more insights from the millions of rows of data that we receive at Fusion, which can lead to better optimization based on these insights. We can then begin to understand the industry better, looking at things like: Are there certain price points that customers prefer? Do these price points align within the business strategy of our partners and of their ancillary products? We also allow our customers to then view the data through our Customer Portal – which is a visualization that helps our partners understand their data, as well as how our tests are influencing the business.
How do you determine the success of a test?
The success of a test is defined as: Did it improve your KPI – your key performance indicator? Did it did it change the conversion rate? Were there more ancillary product purchases within the variant in the test? Or did it yield more revenue from a financial standpoint? Some of these metrics do conflict, so it is important to set this upfront before a test is put into production. For instance, if a variant is increasing the price of a product, although you may take a hit in conversion rate, did that price increase then overcome that conversion rate? At the end of the test, we ask ourselves: “Did the variant beat what we had in place in terms of our KPI? And again, that KPI may be different depending on the test. There are times where we are looking at a more user experience KPI – time spent on page, clicks on a particular booking flow, or something like that. In these situations, you may find that revenue or conversion rate actually decrease, but the overall user experience is improved and customers are coming back more frequently, and in the long run there is a revenue benefit.
So there are a lot of factors that could be involved in determining success – not just a single formula.
How do you determine when to stop an experiment?
Stopping an experiment is a little bit more challenging. We don’t want to call a test too early because the results during the test may not be how it will perform going forward. This is true of both positive and negative tests. We’ve have to make sure we have enough data so that we can safely say “this is what happened during the test period and this is what will happen going forward”. There are underlying statistical tests that we do perform that do help us determine whether or not a test should stop, and if it should continue running, we can then estimate the number of days it should run.
Having enough data to determine this result is the key here. You might see a test you put out on a Monday or Tuesday and you look at it on Thursday and it appears to be performing positively . Well that could just be because that there’s just too much variability within those first few days, so we need more data to determine if the test is successful in the long run. We want to make sure we control and handle that variability, and best lever is gathering more data.
We can also look at how the test has performed over several days to see how it is trending. Even if we don’t have enough data to conclude a test is significant or not, if the test is trending negative, we may want to “stop the bleeding” and go ahead and call the test.
What are some of the limitations of split testing or A/B testing?
Due to the nature of how this type of testing is set up – equal weight among all variants through the entire test period, the biggest limitation is time. Split testing can take a long time to reach a true or significant result. Collecting enough data to be able to assess whether or not the variant actually won is extremely important. Going back to calling a test, we never want to be in a situation where we claim a variant has won during the test period, push it out to 100%, and then notice performance starts to fall off.
What’s the difference between machine learning testing and A/B/n testing?
Well, first I should clarify that machine learning testing is a very broad subject, and we’ve only begun to scratch the surface of this. One of the techniques that we’ve recently employed and that we’ve had great success with is a method called “Thompson sampling” or “Dynamic Weight Allocation”. In a typical A/B/n test, you’re setting the weight equally and the weight is very fixed throughout the life of the test.
With Dynamic Weight Allocation more traffic can go to the winning variant faster. So, after a certain amount of “learning time”, typically for us, we want to see about a week of data, traffic can begin to be allocated to the leading variants.
And what that does is two things: a) exploits the winning variant so you’re able to capitalize faster on the variant that is performing well and b) we can test faster because we are reaching a conclusion faster.
If an analyst was starting out a conversion optimization team, should they start with split testing and then move to Thompson sampling, or should they jump right into Thompson sampling?
I think that depends on not only the familiarity of not only the analyst but also how ready the company is to implement Thompson sampling. There is a fair amount of automation to make the process work as it should, and if the company isn’t ready for that level of automation, the best approach would then be to split test until you are ready. Split testing is certainly better than no testing.
Is Thompson sampling complex to implement?
The formulas and algorithms for Thompson sampling are widely available on the Internet. But it still doesn’t mean that in can be easily employed in whatever optimization or testing system. You know like for me as when I came in as an analyst like I wouldn’t have known what to do with this algorithm – how to actually put it in place. That’s where we needed experts like Herbert (Herbert Xio) that fully understand the algorithm, and know how to put it in place.
It’s one thing to analyze everything offline and learn what you’re doing after the fact. With automated Thompson sampling, it’s allowed us to learn on the fly. As soon as a customer makes a search, that information can immediately be used for the very next customer. There’s no delay in the learning. So, it definitely takes a special skill set of combining math and science – and it helps to know how to program in this day and age, too.
So this is a good segue into talking a little bit about machine learning of the future. In the testing world, not many companies are using Thompson sampling right now but there’s yet another evolution that I know Fusion is has just recently been testing and putting into production with a few of its partners called ‘contextual personalization’ or “CP”. What is CP in machine learning and what’s different about it from A/B testing and Thompson sampling?
Where CP takes Thompson sampling a step further is not only is there dynamic weighting, but we’re also applying that weight to certain customers. Let’s say Thompson sampling is applying the weights 60/40. In other words, a randomized 60% of the population is seeing the variant that looks like it is winning. With CP, that 60% is no longer random. We’re using “contextual” information, like route, booking window, and channel, to assign that 60%. It’s very similar to the offline analysis that I mentioned above with certain pockets of the customer base actually preferring what might typically be called the “losing offer.” In this situation, customers are now getting the right offer with the right price at the right time.
So a variant that you may have said through A/B testing: “I’m going to scrap those and never show that again” because it lost to the majority, can still be introduced and shown to the right people. The great thing is: a test idea never dies! Because there are a lot of good ideas that are put out there, but if a majority don’t respond favorably to them, they will get deemed a “failure” through A/B testing. With CP, these ideas still exist and are presented to the customer that does respond favorably.
From a testing perspective, it’s kind of like sifting for gold. Some sand and rocks (tests) stay in the pan (machine) but some fall out. Those are the poorest performing, so you pull those out and then grab a little more sand (new variations), swish it around, and see what stays in the pan and what slides out.
Also what this changes is the testing process. New ideas can constantly be introduced and decided upon. Instead of starting and stopping tests, you now have a more continual process where some test ideas exist to a small population for a much longer period of time.
It sounds like the machine is doing all the analysis, but aren’t there analysts involved all along the way?
Yes. It’s not a replacement, it’s more of an enhancement. While you have the machine doing the learning, there’s a lot of interpretation that still has to happen. That’s where the analysts really become a key part. They need to always be asking: “Does that make sense from a business standpoint?” It becomes the analyst’s job to justify that the machine is making the correct decisions and adjust accordingly. With CP, the part that is done is the digging, but there is still interpretation that needs to happen.
So, it’s a check that the machine isn’t going off the rails and offering things where it shouldn’t be. You’ve got to have somebody who knows what they’re doing to check that, right?
Right. A good example of this is when we were testing CP, it was a Monday and all of the booking dates were on a Monday. And so the algorithm determined Monday was a high traffic day and the conversion rate is really high. So anybody that books on a Monday, show them a certain offer. The next day that offer wasn’t shown because it’s Tuesday! So there is a lesson: it “learned” too fast! There is a level of making sure that what should be in the model are the appropriate factors. Part of the pre-work adjustments to the algorithm for us was determining what the right data points or drivers are to even put in the model in the first place.
Last question: what do you like most about working at Fusion?
What I like most is that because we are currently a small company, you get to know everyone around and the people here are great. You get to interact with them, and actually learn what they do, and at a lot the larger companies that I worked for, that wasn’t always the case. Here, anyone can step in and come up with a great idea, and actually be heard with that idea.
And those ideas amaze me too. Something so small that you might think has no chance of making an impact on the bottom line, in fact does, and it’s huge! Just goes to show that old saying of “It’s just so crazy, it might just work” really does apply in the real world.
Jason holds a Bachelor’s degree in Business from the Belmont University and has over 20 years’ experience in e-commerce strategy and web development and design.