Implementing Reinforcement Learning for Marketing Channel Optimization with Python – PyCon APAC 2018

I know it is Saturday and it Saturday evening, and I believe that you would love to be anywhere else instead of being here thinking, but for now I would like to share a bit on the enforcement learning for marketing blog. Optimization touch me. This will be interesting, but then, for those good things that way there is a marketing there. Isn’t it too commercial? Let me clarify it this way.

This use case I’m using it to make sure that the case that we are using in the enforcement learning can tackle the real business problem and the way we solve this problem is quite broad so that you can apply this to many other problem related to your Industry and by the end of this presentation I will mention a bit on how enforce the other industries. So, let’s start in the next four thirty minutes, I think or 45 minutes.

Oh, we will discuss this topic in the beginning. We will start to discuss on the background while we discuss on marketing general problem and then we start to discuss on the enforcement learning about the h and we will combine them all and check the result. And finally, I will give you a bit on the final remark, so why marketing blog problem? I know this sounds a bit commercial. But if you are it at the article that tiny, if you cure that at the five four seven it has a lot of zero behind it.

So it’s no longer a trivial problem. We need to be quite wise to manage that budget allocation and not because it’s about money, but because it’s related on how our company will grow in the future and it’s in the same time. We want to make sure that the money is used well, so that the financial stability of the company, but I would like to share a bit on how it works. Basically, our marketers in Africa will define a particular level of budget and they will allocate the budget for particular marketing blog and from that marketing.

Blog user will see the ads and that will be counted as impressions and then, after the user clicks City ads. Some of them will be able to click the ads and those will be counted as clicks and after user clicks. Yet a fraction of them will be encouraged to buy the hour flight ticket or hotel ticket, and those will be called as customer acquisition and from customer acquisition. We cut our money if we see the problem earlier.

We know that there are two different objectives and it’s completely different in one side. In this side you see that marketers. If they’re KPI is solely on increasing customer acquisition, they will be very encouraged to pour out as money too much increase customer acquisition. But there is not a department or finance department and they won’t like it. We want to make sure that the budget that you pay to marketing blog agent is not infinite.

We want to make sure that it’s within our budget, so these two departments will keep debating maybe all day long, but maybe they keep it like one hour but in general, there’s two objectives are quite in the opposite side of our company. So how to approach this problem, how to make sure that the budget that we give to marketing blog action will be efficient well. In the same time, we can also manage to increase our customer acquisition.

Let’s take a step back on how to approach this problem by taking another analogy, because my patron is in supply chain and understanding. Let’s take a analogy from manufacturer, let’s say that you have a manufacturing and you just buy this robotic arm and your robotic arm. You want to make sure that this robotic arm can pick up ox for you, because paying people now is more expensive. We cannot put implementing it in two different perspectives.

That part is from optimization perspective, and this pink part will be in reinforcement, learning perspective. In optimization perspective, its conventional, you just make sure that the mistakes in picking how we will toss off the action taken in time T given that tip will be from 1 to n. Let’s imagine that taro Patek I’m in the position, zero and then PI in time. 1, the robotic arm will move up, it may be 15 decrease and maybe in time to the robotic arm will move, like maybe 30 degrees until time 10, for example, where that robotic arm is able to pick up the box.

But that requires planning and planning requires optimization. So, let’s imagine if we shift the box up it, then we will need to rerun the optimization algorithm again to make sure that the action taken is is able to fulfill our requirement. Let’s move to the right path, it sounds easy in reinforcement, learning which has stitched the robotic arm, pick the box, but then, if the robotic um take the wrong side, for example, let’s say the box in the left side and then the robotic arm moving for moving To the right side – and you say that the robotic arm it’s wrong – you should move a bit to the left, so that’s the basic of how the informal learning learning can be implemented in our analogy now, as we understand this analogy, the sting of the robotic arm As the technician that is required for the marketers to take in daily basis in terms of budget allocation in day one, they need to allocate particular budget and, in day two.

They also need to allocate particular budget until the end of the month. For example, where they’d want to make sure that the whole KPI, for example, the efficiency of their budget, is optimized, so by understanding that limitation, we finally chose to use reinforcement learning. Why? If we look at this table, there are three features equation to develop: computational efficiency and new trata. The first two, I think.

Maybe we are kind of understanding that optimization and reinforcement learning is on the same crown. It’s like I can take either optimization and reinforcement. Learning. No matter, whichever method that I will take because in optimization I will only need to think about objective and constrain, but in the enforcement I still do think as well, but I just need to define the word function: national efficiency in optimization.

It depends on how you formulate a problem. If the optimization problem is in terms of linear and it’s solvable, then it just takes maybe one or two second to solve the problem. But if the problem is nonlinear, then it will take some time to explore the solution. Space and it’s not good, especially when you’re marketers is very demanding and then, if you are using reinforcement, learning the training itself texts quite and if or even for simple problem.

But then, what the main reason of for us to use reinforcement learning is because of the third feature that I mentioned. We are expecting that we will have new data in daily basis and if we are implementing optimization here, imagine that, if we implement optimization having new data in telophase implies that we will have a new solution space in daily basis as well, and it will require us To keep rerunning the algorithm until you find the optimal solution, and that’s quite tedious actually and that’s why we are proposing to use informal learning.

Because, in the beginning, the training that takes a lot of effort will pay well, because it allows us to learn the parameter. That can map between current state and the proper action that should be taken to maximize tell you were checking who is here from computer science? Okay, so you guys are smarter than me. We will talk a bit on the enforcement running. I think you are already learning reinforcement, learning in college, so if I say something wrong, please correct me because I know you are from counter science.

This is just my simple understanding on what is reinforcement. Learning there is an interest and pheromone HN will have a current state and in the current state, action will take an action. The action taken will be received in the environment here with tech environment as our marketing general problem and from that action and pheromone will give response in terms of reward and that reward will allow the agent to move forward to the new state.

So to understand the thermos further, I think this one is for you. We haven’t learned, reinforcement learning here, we have motor motor is used to describe the environment, and policy is to describe how agent takes action and there is also a failure function and, according to my understanding, there are a lot of way to define value function in reinforcement, Learning because it describes the word of an action, and I think this is important, because the way you define the reward will you find how to hm detects the action and when you want to make sure that your agent texts, a proper action, you will need to Define value function carefully and here reward refers to the feedback based on the current step future.

What will you find cumulative this country work that I make this contact reward path, a pitbull and if you see the blue, the blue part and make it smaller of it, because it’s just count and then there is also state value action value, an advantage. The last three, I think, is closely related. State value is basically expected expectation of future. You are given particular state and action value is the expectation of future reward given state and action, and the advantage is the difference between put that value and action value to make it more intuitive.

Let’s think that if you have a higher action value, it implies that the decision taken given a particular action is better than having the average having the average that value, because, basically, that value is the average of everything. But the action value is the average given particular state and action. Now we want to learn more about the agent, so we will discuss this left part. We know that agent takes action, but then how does the agent learn to take good actions? There are a lot of way for the agent to learn to take good actions.

But here what I want to explain is policy Creighton method in which, basically, the agent that part will have a particular parameter that will be used for the policy to take action and that action will be implemented in the environment and the environment will result, in particular Value function and from that particular fault function there will be an optimization process from which the value function will be used as the basis to optimize the parameter use in the policy, so that one is just my rough understanding to be more specific.

What we are using now is proxima policy optimization, it’s specifically a type of policy creation method, except the three different plot that I have colored with in there. So we have aging and it still has particular parameter and it still has fallen see we will purchase we’ll use the parameter to purchase action sampling instead of action itself. So if you imagine, because we are here facing a marketing budget problem, it means that our problem is in the continued space.

So taking a deterministic action decision will not work. Well, that’s why we need an action sampling and this action sampling will be implemented in the environment and here from the hole, 5 value function. We will use advantage to peel the passage for the optimization and we also implement sort of constrain to make sure that the update on the policy parameter be drastically different, because a drastic update in PPO means that the algorithm will perform very unstable.

So here I would like to emphasize why we choose PayPal, because there are four point that I mentioned. Basically, the first point: maximize reward based on the policy is the strength of PPO, because it’s a part of policy clarity and method. Basically, it allows PPO topper performance better in continuous space and the use of action simply lost better exploration and the use of advantage lost a reduce variance.

The fourth point, I think, the key of understanding people, because if you see the passes of PPO is the RPO, which is a translation policy organization, and I think translation optimisation is different from the usual condition. I think we are underst, so transaction optimization is a different family of optimization, usually in machine learning. We understand condition as a part of line search method.

Basically, it stops based on the Creighton but introspection optimization. You are stepping within a certain translation, so you do not want to a step further if it’s outside your translation, because, basically you don’t really trust outside the transaction boundaries, and this is implied in PP OPA using clip mechanism, and this will help the algorithm to perform In a stable wide way so that it can help us during the training period.

Now, let’s talk about the environment, the environment is in the dead part. The thing unreal environment is expensive and it’s left it’s less likely for your company to pay for your experiments. What we do here is to place this complicated part with a single motor. It’s a simplification. Basically, we try to imitate what happens in the real environment so that we can use that model to sort of simulate what happens in the real world problem, but how? How can we create our model? There are two different approach here.

Direct who have heard about regression can trace Juergen. Should I explain okay so for those who has an understand, I think you can cook a letter, but the question is basically the way for us to map between input variable to make sure that we can produce output variable. So in the regulation we are treating data as random sample and the parameter is assumed to be fixed and the difference between data and twelve output will be treated as error.

That’s right there. We are only seeing one line because all is assumed to be fixed and everything else is computed as error. But if we see that one particular line, we see that there is no uncertainty there, because everything else is assumed as error, and we cannot use here this regression into our enforcement algorithm. Because you see you want to make sure that your agent is able to detect as attack as possible.

But then, if you are teaching them with fixed input, then you will be sort of doing a very non productive things. So what we do here, we are looking at patient recreation. Have anyone heard about patient attrition? No one, so patient regression is basically, I think, another way for us to approach more telling, instead of treating parameter as freaks, we treat parameter as probabilistic. In the second point, we will need Pryor for that, but it’s worth the effort because, basically, if you see there are a lot of lines because it’s trying to capturing the potential of uncertainty and that’s what’s happened in our real world and that’s why we are using Patient, so in the environment, we are using position regression.

Why? Because, even if the training takes effort, we are able to capture uncertainty and that allows us to teach the agent to text more smarter decision and we will combine them all so far combined. We have learned about reinforcement, learning agent, we have learned about reinforcement, learning environment and we have also learned about patient education, but then how to put things into places here.

I have created a sort of very simple map on how to combine those different things into one reinforcement: learning that allows us to take smarter decision. Oh, we will not learning it in sequence, but we will start with the more telling first, the first point, and then we start to learn about the beginning of the algorithm, which is in the second part, and then we will learn. The third part, which is the final part of the algorithm, so we will start in the first point, remember that in the mottling we are using patient regression in position regression, instead of assuming that y equal to a times X, plus B, we are assuming that Y.

Following particular distribution and that’s what happened here not say that Y is impressions and the X is action and holiday. We are treating impression as a sort of random variable that follows negative binomial distribution with parameter mu I and theta, and here parameter mu. I is defined as exponential of the long things insert a bracket, but if you see that long part inside the bracket is sort of similar to linear regression correct, so it’s like one step forward going from the usual inner equation that is not April to capture uncertainty.

To another met method called patient equation that can capture uncertainty. The same goes with link clicks. Basically, we assume that link clicks also following another tissue portion of negative binomial, with particular parameter that following the equation equation inside the brackets and in the end we have free worth, which is defined as by justified, appealing clicks and we have a lock function there and Minus sign there and this will be an input for componentry.

If you remember, abhorrently is the last part of the algorithm question: how to implement position regression Wi-Fi Stan because it has an extensive documentation, it it’s quite reliable and it has a variety of functions and for those who have read Python documentation, it’s quite extensive and it Is able to capture Safra function that maybe is not available in utter patient, patient type for off library such as pi, MC or edwards? Next, in the second part, this part, I try to simplify it to make sure that it’s similar to the one in policy creation, but in our implementation we are modifying a bit to make sure that there is also a part of it that can help the agent To also predict well the reward that will be given by the environment, so, instead of having policy network only, we will have also value Network policy network, as its name is a network that is used to sort of taking a good position, and this can be done As policy Network producing action parameter and that action parameter is used for action, sampling which will go to the first part of the our map, that computer part and the failure network will be used to predict state value.

And this will go to the third part of our map, so value network and policy network. The input is only state and if you want to make it more complicated, you can also take into account your action into the network, but that will be in other cases and then this is the and a combine our reinforcement, learning algorithm. We will calculate the advantage and also make sure that it will be optimized. How remember that, for this third part, we have two different inputs, pre-work from the first more telling in python, and also predicted state value, which is come from the second part of the map.

And we can use those two to make sure that we have an optimized parameter for our value network and in the same time, we can also use that to calculate the advantage and then make some adjustment by using importance, weight and prep mechanism. And we can use it to optimize policy network so that we can improve the parameter for policy network, but seems the computer requires we can okay for the algorithm.

We use my touch here. Why? Because this is still our initial study, so we are planning to develop it mark. But for now, because this is initial study, we will require a lot of test and error to check what happens in the when it’s exposed to real environment. But in short, we are using five touch because it’s flexibility and it’s also easy to debug and also for convenience, especially for pyro. When you are understanding that you want to make sure that there is violation mechanism that you want to involve in your algorithm.

Having pyro. In your library is quite and convenient because it allows us to explore probabilistic rocker programming in our neural network, so Theresa. If you see here after a 100 iteration, the failures quite converse and importantly, the reward is also keeps increasing until it saturate at around 8800 iteration and from there. I would like to continue to finally map what we have done so far. We have made sure that we see it a real-world problem and we use that real-world problem to Motel the agent temperament to make sure that we have equipped erm water and then we are training and checking the result of the motel, which that is kappa /.

A simple rational decision maker, but then what’s next in term of motor component, especially in the past, an area. Personally, it’s my interest to adding to add dependence to the other states and maybe also to add seasonality Antron, because this initial study, we try to make it as simple as possible to make sure that it’s really rational, as it says, and in term of other project. This is for you who thinks that marketing is too commercial, but by understanding all of the framework that I have explained earlier.

I think, having this framework to be implemented in production planning, logistic and industrial automation is also worth to try, especially because, if you think further, because in production, planning and logistic, usually the one who implemented those three. Those production, planning and logistics are human and most of the time they are quite flexible and adaptable to see what happens in the real environment, but in the same time they also have pauses, and this process and their colleagues sometimes make some point, be rational and I Think it’s worth to try to see what happens when we are developing a sort of smarter agent to make an automatic decision-making so that we can at least guide the human who texted realization to make sure that decision is still rational along the way.

So, for reference, this is based on the compilation of the work of many people, especially from the pepper from Sherman, and also a lot of tutorial from many other article and for those who wants to learn more about passion. Statistic I’m suggesting the tenth point, and this is also the result of me asking so many people and I’m glad that they are not kicking me out yet and also the marketing team that allows me to explore this use case for PyCon.

Thank you measure, Koscheck. Okay, with a kind of no it’s a country – Oh so let’s say that you have a solution: space tip to speak, but in PPO you try to limit yourself. Also, let me start it again. You have this ocean space to speak and you are in this point and the optimization algorithm says: if you want your parameter to be Aquaman, you have to jump that far, but in PPO you do not want to jump that far, because jumping that far will allow Your algorithm to be unstable, so you are limiting yourself in a tiny Q peak to make sure that you are not jumping that work, but something this small and that constrain is implemented in a simplified way by using creep mechanism with mechanism is basically just saying.

I just allow myself to start as much as maybe s 1 centimeters instead of 9 kilometers, the bundle. Oh so it’s a hyper tuning parameter in PP. Oh, it’s like an epsilon parameter where you allow 1 plus epsilon and 1 minus Epsilon. So this is the way the data is presented. Basically, we have a temple, the content, historical data and that historical data basically have an X in terms of action and quality.

Whether the action is taken is in the holiday or not, and we use it to train how many impressions and how many links and how many reward, but this model is used to make an imitation for the real environment. So it’s used as the black box here. If you have freed a bit on how to implement the enforcement learning most of the time, we are not using sort of more telling all most of the time they use important AI.

You try to run a simple panda loom or to make sure that a car will move around, but that sort of physical mechanism is already defined by the library team ai. But here we are replacing that library by using patient relation very good question. My god is quite messy right now and I think, because of the limitation of time I haven’t, submit my cut to the HR. In the same time, I need to make sure that my business trip is reimbursed.

So I think after this you can email me and maybe we I can make sure that the code is quite neat for to be shared yeah. Oh no! No, the one that we compare is the basic one, because when we replace this with when very, but when we are intending to replace human with this edge – and we want, they will be very afraid correct. So what we are comparing is what the real human will take, the decision it will take and we compare it with the result of this algorithm and the difference is not that big, I’m not expecting for the algorithm to be in the most Cockerham.

But the difference is much less than 5 %, so I think for the potential of engine optimisation for the for the potential of having make having to make it automated. I think it’s worth in the future to tie my father. Excuse me: can you motel free? Oh, I think this one is more Terry yeah, PPOs, motel free so for a particular marketing blog. We will have time for marketing teams and they are sort of giving us the collection of historical data or in day one how much money they pay and how much impression ticket and link clicks.