Herbert Xiao, Machine Learning Engineer at Fusion
Jason Keough: What is the difference between deep learning and machine learning?
Herbert Xiao: Deep learning is a branch of machine learning that is mainly about neural networks. Neural networks by itself is not a new thing and was created like 60 or 70 years ago and it wasn’t popular until recently because people didn’t have the computation power to create neural networks with really deep architecture. But now people can do that and that’s why it’s called deep learning. With deep neural networks, people can achieve what shallow neural networks or other machine learning algorithms cannot such as beating the best human player in a GO game. So that’s why this is so popular now.
Jason: What is the difference between supervised and unsupervised machine learning?
Herbert: The difference is whether the data to learn from is labeled. If labeled, it is supervised learning. For supervised learning, usually the algorithm or model is given a sample input and output. The output here is the label of each instance of data. In Fusion’s context, it can be conversion rate, revenue, or click through rate – things like that. In supervised learning, the task is usually to learn the pattern or the relationship between the input and output. For unsupervised learning, there is no given output. The algorithm is just going to find certain patterns or commonalities in the data by itself. So in Fusion’s context, such a task can be stuff like segmentation of the customers.
Instead of defining the output, you define the pattern or similarities you desire and let the algorithm itself figure it out. There’s a branch of unsupervised learning algorithms called clustering. It’s just cutting the data into different groups, and in the same group, the objects are more similar to each other than to objects in other groups. It’s just dividing data by similarity, which has to be predefined.
Another example is called “bucket analysis” or “association rule learning”. That is to find the related objects in the data. The reason why it is called “bucket analysis” is because it started with a prototype question from a supermarket: What items are usually bought together? If the supermarket knows that, they can just simply put those related objects next to each other.
Jason: Like peanut butter and jelly.
Jason: Do you use existing algorithms for machine learning or do you write them from scratch?
Herbert: I do both. Most of the time, I use existing implementations because the community is very strong now. Most of the common models are already implemented so I don’t need to reinvent the wheel. But sometimes, if the algorithm is really new or rarely used, I have to implement from scratch.
Jason: How do you know which machine learning algorithm or model that you should use?
Herbert: It depends on many things. What is the target or what is expected from this learning task? What resources do you have? By resources, I mean engineering power and computation resources. And also, what is the requirement for accuracy and what is the requirement for interpretability? These are usually paradoxical things. If you want to have higher accuracy, you most likely will end up with less interpretability.
In the credit card industry, this is actually a very severe problem because in order to be in compliance with the law, your model has to be interpretable or transparent. You must be able to tell the committee how and why each decision is made. That is why neural networks are not widely used in credit card industry because people cannot find a proper way to interpret the model.
Jason: But it may develop over time to be more accepting of that?
Jason: But just today, it is not really accepted because of the traditional oversight, compliance, and all these legal rules and things.
Herbert: That might change, but what I think is more likely is people somehow find a way to interpret the result of those black box models because this is a really hot research field. People are trying to figure out all kinds of methods or algorithms to kind of ‘unbox’ the black box. One such tool or method Fusion is currently using called SHAP – it is kind of like a universal approach to all kinds of black box algorithms. It gives you a certain transparency.
Jason: Which tools and languages do you prefer to use to build models?
Herbert: I definitely use Python. It is the most popular programming language for machine learning. I choose it not because it’s popular, but because it’s popular – meaning that there are a lot of open source libraries available – much more than other languages. There are also new languages which could be better than Python in the future, such as Julia. But the reason why I’m not using it is because it’s not that popular! So we have to build from scratch many things.
Jason: What about models? Are there particular names or models that you use?
Herbert: The most important one from a list of Python libraries I usually use is called “scikit-learn”. Also “TensorFlow”, which is probably the most popular machine learning library right now, specializing in neural networks.
Jason: How is Fusion using machine learning to help their customers today?
Herbert: I would say mainly on two aspects: The first one is on the analytical side. It’s using machine learning to get a better understanding and to get more insight from the data. So, in this aspect, Fusion is using models like model-based tree to find the groups or segments of customers that have a different reaction to the offers being tested. And this is kind of special is because it is supervised, not unsupervised. By using clustering or some other unsupervised algorithms we can find groups where customers have similar features but they don’t necessarily react differently to offers but if you use supervised learning to do this kind of task, you can find groups where the customers have both similar features and reactions to the offers.
Jason: Is it “supervised” primarily because there’s very specific KPI like conversion rate or revenue that is driving the overall goal associated with it?
Herbert: Yeah. So it’s not just finding similar customers, but to find what kind of customers react to the aspects of the product, price, message, or display.
Jason: And react to it in a way that is positive to whatever the KPI is.
Herbert: Yes, not just guessing. The second aspect is the improvement of testing and optimization methodology so it is conducted in a more industrialized and automated way. So for the same testing strategy, Fusion can get more potential revenue and less cost to conduct a test.
Jason: Can you dive just a little bit deeper into what exactly Thompson sampling is, the benefit, and how it works? And then maybe touch a little bit on contextual bandits.
Herbert: Sure. So before Thompson’s sampling, Fusion was using A/B testing. In A/B testing, you allocate the portion of traffic or customers to each branch (A or B) not necessarily equally but statically. With this method, you let the test run for a certain period of time and you calculate a significance to determine which one is the best. This method is very stable and dependable. However, you potentially lose revenue because if a branch performs really badly, you still have to allocate a lot of traffic on that branch during the test. What Thompson sampling does is like doing A/B testing in a dynamic way. It adjusts the traffic allocated to each branch according to the real time performance. So in doing this, you can achieve more revenue while maintaining the same level of dependability or accuracy.
Jason: So the algorithms are essentially doing the analysis on the fly and as sales are happening, it is determining whether or not to put more people into one area or another.
Herbert: Yes, it can happen in real-time after each transaction.
Jason: Is there any difference between the terminology used for A/B bandits or Thompson’s sampling?
Herbert: No. I think “A/B bandits” is kind of a made up term. There really is only A/B testing or Thompson sampling. Actually, Thompson sampling is not from the same branch of an algorithm as A/B testing. It was originally for a prototype question called ‘bandit problem’ or ‘explore/exploit’ dilemma. It’s the algorithm for this kind of problem but because Thompson sampling output (the weights) can be also interpreted as the significance level. That’s why we can use these two algorithms together.
Jason: I just know I’ve heard both used and I wasn’t sure what the difference was.
Herbert: Think of Thompson sampling as a dynamic version of A/B testing.
Jason: Tell me about some of the things that you’re working on to improve machine learning?
Herbert: ‘Contextual bandits’ is one of the new things: It’s about using additional ‘contextual’ information as opposed to just the revenue or performance of each branch. So the algorithms are trying to find the relationship and pattern between this additional information or feature and the performance.
Jason: Can you explain a little more about the concept of contextual bandits with features?
Herbert: Sure. There are two types of features. The first one is the features of the offers, such as price, header, body, footer. It can also be color, size or location of the header or anything associated with the offer. The second type of features is more contextual like the customer and transaction information that can be used for personalization. Data like day of the week, time of the day, city of departure or traveler or something like that. Then, the algorithm takes the contextual data and estimates the effect on each of the features and eventually provides the best offer for each customer.
Jason: So how does ‘contextual bandits’ compare to A/B Testing and Thompson sampling?
Herbert: In both A B testing and Thompson sampling, the goal is to find the branch with the best overall performance. For the second part, contextual bandits, you can exploit the potential revenue that A/B testing and Thompson sampling cannot exploit. For those two algorithms, although one branch has the best overall performance, if you put some of your customers in another branch, they might respond differently. With contextual bandits, it’s not just about selecting the ones with the best overall performance, it’s to have a better revenue per branch or per candidate. So each candidate in contextual bandits can generate more revenue per transaction than A/B testing and Thompson sampling.
Jason: Tell me about some of the things you’re working on to improve machine learning at Fusion.
Herbert: OK. I’m currently working on the contextual bandits and there’s one step or one phase of contextual bandits which is to estimate the parameters of each factor. People usually use a family of algorithms called Markov chain Monte Carlo (MCMC). We have tried that and the problem with MCMC is it is very computational demanding. Or in other words, it’s super slow! Because Fusion has a huge amount of data, MCMC is not very suitable for this situation.
So to solve this problem, we are doing research and experiments on alternative algorithms and applications. For algorithms, we are currently trying out ‘variational inference’. And we also use approximations. Another solution we are looking at is to use ‘TensorFlow probability’ which also has MCMC implemented but because that library is based on GPUs, it can compute much faster than our current CPUs. That’s one aspect.
Jason: Interesting. On a personal level, what future applications of AI and machine learning are you most excited about?
Herbert: It’s definitely the biological/medical applications. For example, I read an article about a research at Ohio State University where they helped a guy with a broken spine to play guitar again. What they have done is use machine learning to interpret the signals from the brain into the signals of the hand.
Jason: Wow, that’s pretty cool.
Herbert: Another field I’m really interested in is utilizing machine learning and AI to research and design new medicines. The current way of researching new medicines is somehow undirected and random, but with machine learning, people can find a lot of very complex and intrinsic patterns of the molecules and truly achieve active ‘designing’ of new medicines.
Jason: So is that a combination of bio-engineering and chemical engineering?
Herbert: Yes, with the combination of biology, biochemistry, and machine learning, it may enable us to cure cancer or AIDS someday.
Jason: That would be something! Hey, thanks for taking some time to chat today, I really appreciate it!
Herbert: Thank you, it was fun!
Jason holds a Bachelor’s degree in Business from the Belmont University and has over 20 years’ experience in e-commerce strategy and web development and design.