Q-learning powered game-bot!

Introduction and Overview

Experience through wins and losses leading to refinement in decision making is the crux of the model that I will be discussing in this article.Q-learning (the ‘Q’ standing for quality) is an off-policy learning model based on ‘states’,’actions’ and ‘rewards’. It is highly suitable for situations where the number of possibilities is too high for a computer to completely process at once. Board games such as Chess and Go are examples where the Q learning algorithms have flourished, beating the top players in the world.

For my project, I have created a Game-Bot that plays the popular game of Tic-Tac-Toe, but on a 4x4 matrix instead of a 3x3 (three consecutive X’s or O’s are still a victory). I set the rules in a manner where the chances of losing are high, compared to a normal 3x3 matrix game, so that the learning curve for the AI algorithms is steeper. All the code can be found under the ‘4x4TicTacToe’ repository on my GitHub profile.

A ‘state’, in this case, is a given situation on the board. An ‘action’ is a move that is possible for the bot to take. We start from a randomly generated state, and take actions on it. Depending on the final outcome of these ‘state-action’ pairs of the bot versus itself, a ‘reward’ of +1 is awarded to the final actions leading to a victory. From here on, the following Q-learning algorithm is implemented, which ensures that the actions following that victory are rewarded with a ‘discounted’ value.

The state-action Q-value table generated by continual evaluation of Q-values is basically a probabilistic model of approach for the bot. It starts off with knowledge of nothing, performing random moves and winning or losing. It is much more naive than a human being at this stage, as it has almost no cognitive capabilities. In a few iterations of training, it reaches a level that seems to be nearly unbeatable by human beings.

The only hyper-parameters of this algorithm are the learning rate and the discount rate. The learning rate affects the amount of weightage given to new information being added to the system. The discount rate, as aforementioned, is a parameter that affects the depth preference of analysis.

In very simple games, a Q-table is implemented, which contains all possible state-action pairs with their respective Q-values computed and stored. In the game that I propose, this number amounts to approximately 14.5*16 million state-action pairs. If you consider 4 bytes per value, the total size of the Q-table for such a seemingly simple board game is 13.5 Tera-bytes. This computational impracticality is overcome by using a function approximator instead of the actual Q-table. I have used a neural network for this purpose.

Overall implementation

Forming the Q-table*

The first thing we need to do is to program the bot to learn what the winning moves are. The bot learns this through experience over lots of playing and exploring using the aforementioned algorithm. The table comprises of many states, with values for most of the actions possible on that state. The longer we calculate this Q-table, the more optimized are our bot’s moves. Here’s the code for the core function that calculates the Q-values for a given state, and puts them into a table.

import numpy as np
import time

def QtableUpdate(table,tstate):
	gam = 1
	for i in range(int((np.count_nonzero(tstate)+1)/2)):
		tstatepresent = False
		action = RemoveAction(tstate)
		tstate = np.subtract(tstate,action)
		if TermStateCheck(tstate):
		break
for checker in range(len(table)):
	if(np.array_equal(table[checker][0],tstate)):
		tstatepresent = True
	if(tstatepresent):
		row1 = index_2d(table,tstate)
		col1 = getActionNumber(action)
	if gam > table[row1][col1]:
		table[row1][col1] = gam
	else:
		table.append([])
	table[len(table)-1].append(tstate)
for i in range(16):
	table[len(table)-1].append(0)
	row1 = index_2d(table,tstate)
	col1 = getActionNumber(action)
	table[row1][col1] = gam
	gam*=0.9
	

This is iterated over many randomly generated terminal states with 5,7,9…15 number of total elements, to arrive at the final Q-table.

We do not actually need to store the calculated Q values in a table for aforementioned reasons. The code given is only to improve the reader's insight into the math behind the Q-values and Q-table computations. This code will be reshaped and used in a slightly different approach to compute the Q-values for a neural network instead of saving them into a table. This is 2017, not 1992.

Neural Network implementation

As it can be seen from the results, the time taken for 1000 states, as we go on saving the Q-table and making it larger, increases in an arithmetic progression of about 30s per batch. This makes training impractical as the amount of data required to be saved as well as the computational time of 86 hours for one epoch(episode) of q table calculation are quite impractical. Instead, a neural network does the job of mimicking the Q-table by fitting the states with their corresponding optimum actions. This is done for multiple reasons :

So, here’s the final procedure that I have followed to train the bot:

  1. Set up a fully connected neural network to accept a state of the board, and give out Q-values for all possible actions.
  2. In every instance of training, feed in a randomly generated state (s). With a probability of ? (epsilon), take a random action (a), else take the action predictedby the neural network (a) from the Q-value array outputted to us. (qval)
  3. In the new state (s'), get the maximum of predicted Q-values from the neural network (again)
  4. The loss function = qval with the element corresponding to action a changed to reward(s') + (gamma * maxQ).
  5. Train for many randomly generated states.

NOTE : If s’ is a terminal state, maxQ = 0.

And, we can add the “experience replay” methods, to prevent catastrophic forgetting effects*, as follows:

  1. In state s, take action a, observe new state s(t+1) and reward r(t+1). Store this as a tuple (s,a,s(t+1),r(t+1)) in a list.
  2. Continue to store each experience in this list until we have filled the list to a specific length (up to you to define).
  3. Once the experience replay memory is filled, randomly select a subset (e.g. 40).
  4. Iterate through this subset and calculate value updates for each; store these in a target array (e.g. y_train) and store the states of each memory in X_train.
  5. Use X_train and y_train as a mini-batch for batch training. For subsequent epochs where the array is full, just overwrite old values in our experience replay memory array.

*Catastrophic forgetting is the problem of neural network implementation where if we approach the states unilaterally, without much randomization in the batches, the neural network gets acquainted to static goals. In our game, the goals are pretty static, and hence this method won’t help much. In other more complex games, where the same action set from the same initial state might not lead to victory even if it initially did, the network gets confused and learning isn’t effective UNLESS the batches to train the network are randomized.

We shall use the following reward function for getting the target outputs:

We want our bot to prefer victory, a draw,a loss and then finally an unacceptable move. The algorithm works well if the reward function is coherent with what we want our bot to learn in a prioritized manner.

Neural Networks competing with each other

So now that the basics are out of the way, let’s get more technical. Q-learning has to be implemented separately in separate games keeping the foundational principles in mind. Since our game involves two players playing in turns, we need two neural networks to achieve learning. I’m using keras to implement two sequential models of fully connected neural networks, that will be fighting each other with interdependant cost functions (more about this in the next section). Each neural net will have the formation and constraints that I’ve previously introduced, with the reward function making sure that it learns the right things.

Here is the code for the two networks simultaneously competing to learn the game:

import Qtable as table
import numpy as np
import random

from keras.models import load_model
from keras.models import Sequential
from keras.layers.core import Dense,Activation
from keras.optimizers import RMSprop
import time
if(0):
    
    model1 = Sequential()
    model1.add(Dense(164, init='lecun_uniform', input_shape=(16,)))
    model1.add(Activation('relu'))

    
    model1.add(Dense(256, init='lecun_uniform'))
    model1.add(Activation('relu'))
    
    model1.add(Dense(256, init='lecun_uniform'))
    model1.add(Activation('relu'))

    
    model1.add(Dense(16, init='lecun_uniform'))
    model1.add(Activation('linear')) 
    
    rms = RMSprop()
    model1.compile(loss='mse', optimizer=rms)
    
    model2 = Sequential()
    model2.add(Dense(164, init='lecun_uniform', input_shape=(16,)))
    model2.add(Activation('relu'))

    
    model2.add(Dense(256, init='lecun_uniform'))
    model2.add(Activation('relu'))
    
    model2.add(Dense(256, init='lecun_uniform'))
    model2.add(Activation('relu'))

    
    model2.add(Dense(16, init='lecun_uniform'))
    model2.add(Activation('linear')) 
    
    rms = RMSprop()
    model2.compile(loss='mse', optimizer=rms)
    
if(1):
    model1 = load_model('nnmodel11.h5')
    model2 = load_model('nnmodel22.h5')
    

totloss = []
totgames = []
gamesperset = 1000
gamesets = 0
gamma = 0.9          #since it may take several moves to goal, making gamma high
epsilon = 0.5
#start = time.time()
for w in range(gamesets):
    for q in range(gamesperset):
        state = table.RandomState()
        counter = 0
        loss = 0
        states = []
        actions = []
        rewards =[]
        maxQ=[]
        newQallowed = []
        firstmove = True
        while(not(table.TermStateCheck(state))):
            states.append(state)
            qval = model1.predict(state.reshape(1,16), batch_size=1)
            if (random.random() < epsilon): 
                action = table.TakeAction1(state)
            else: 
                qvalsort = -np.sort(-qval)
                for k in range(16):
                    action = table.getActionFromNumber1(np.nonzero(qval==qvalsort[0][k])[1][0])
                    if(table.ActionAllowed(state,action)):
                        break
            actions.append(action)
            new_state = np.add(state, action)
            
            if(firstmove and table.TermStateCheck(new_state)):
                maxQ.append(0)
                break
            
            if(table.TermStateCheck(new_state)):
                maxQ.append(0)
                break
            

            newQ = model2.predict(new_state.reshape(1,16), batch_size=1)
            for k in range(16):
                if table.ActionAllowed(new_state,table.getActionFromNumber2(k)):
                    newQallowed.append(newQ[0][k])
            maxQ.append(np.max(np.asanyarray(newQallowed)))
            
            newQallowed = []
            firstmove = False
            
            states.append(new_state)
            qval = model2.predict(new_state.reshape(1,16), batch_size=1)
            if (random.random() < epsilon): 
                action = table.TakeAction2(state)
            else: 
                qvalsort = -np.sort(-qval)
                for k in range(16):
                    action = table.getActionFromNumber2(np.nonzero(qval==qvalsort[0][k])[1][0])
                    if(table.ActionAllowed(new_state,action)):
                        break
            actions.append(action)
            new_state_2 = np.add(new_state, action)
            
            if(table.TermStateCheck(new_state_2)):
                maxQ.append(0)
                break
    
            
            newQ = model1.predict(new_state_2.reshape(1,16), batch_size=1)
            for k in range(16):
                if table.ActionAllowed(new_state_2,table.getActionFromNumber1(k)):
                    newQallowed.append(newQ[0][k])
            maxQ.append(np.max(np.asanyarray(newQallowed)))  

            newQallowed = []
            state = new_state_2
            
        if(len(states)==1):
            rewards.append(10)
        else:
            for c in range(len(states)):
                rewards.append(1)
            rewards[len(rewards)-1] = 10
            rewards[len(rewards)-2] = -2

            
        for j in range(len(states)):
            state = states[j]
            action = actions[j]
            reward = rewards[j]
            maxQval = maxQ[j]
 
            if j%2==0:
                y = np.zeros((1,16))
                qval =  model1.predict(state.reshape(1,16), batch_size=1)
                y[:] = qval[:]
                for i in range(16):
                    action1 = table.getActionFromNumber1(i)
                    if(not(table.ActionAllowed(state,action1))):
                            y[0][i] = -4
                if reward != -2: 
                   update = (reward + (gamma * maxQval))
                else: 
                   update = reward
                y[0][table.getActionNumber(action)] = update 
                print("Game #: %s" % (q,))
                print("Gameset #: %s" %(w+1,))
                print("m1")
                model1.fit(state.reshape(1,16), y, batch_size=1, epochs=1, verbose=1)
            
            else:
                y = np.zeros((1,16))
                qval =  model1.predict(state.reshape(1,16), batch_size=1)
                y[:] = qval[:]
                for i in range(16):
                    action2 = table.getActionFromNumber2(i)
                    if(not(table.ActionAllowed(state,action2))):
                            y[0][i] = -4
                if reward != -2: 
                   update = (reward + (gamma * maxQval))
                else: 
                   update = reward
                y[0][table.getActionNumber(action)] = update 
                print("Game #: %s" % (q,))
                print("Gameset #: %s" %(w+1,))
                print("m2")
                model2.fit(state.reshape(1,16), y, batch_size=1, epochs=1, verbose=1)
                    
        if epsilon > 0.1:        
            epsilon -= (1/gamesperset)
                
    model1.save('nnmodel11.h5')
    model2.save('nnmodel22.h5')
#print(time.time()-start)    
#    pyplot.yscale('log')
#    pyplot.xscale('log')
#plt.plot(totgames, totloss, 'o')
   
wins = 0
for i in range(1):
    state = np.zeros((4,4),dtype=np.int32)
    print(state)
    while(not(table.TermStateCheck(state))):
        action = table.getActionFromNumber1(np.argmax(model1.predict(state.reshape(1,16),batch_size=1)))
#        action = table.TakeAction1(state)
        state =  np.add(state,action)
        print(state)
        if (table.TermStateCheck(state)):
            wins+=1
            break
        state = np.add(state,table.TakeAction2(state))
        print(state)
print(wins)

So that’s about it! Some basic things to note were that I tinkered with the epoch numbers, iteration numbers and the learning rates of each model individually and the aforementioned is the best combination that I could come up with. Any suggestions and/or improvements would be greatly appreciated!

How is this working? What’s really happening?

In every instance of training, we are maintaining the relation of the Q-value of a state action pair and the next state, simply because that is what we have set our targeted output as. We fit both the networks in such a manner that Q(s,a)= reward(s’) + (gamma * maxQ(s’,all a’). Thus the connect is maintained. Now, for the case where s’ is terminal, we are rewarding s with +10 and that’s it. Similarly we have defined the set of terminal state related rewards for the preceding states’ Q-values. This is consistent with what we ACTUALLY want to approximate.

Hence, we’re maintaining the connect between states, and are being consistent in rewards with the actual Q-table. This is leading to an overall approximation of the Q-table itself. The randomized actions and experience replays are just tools to enhance this basic essence of the process.

The math behind Q-learning is very concrete. It ensures that the Q-learning approximator has a possible point of convergence, and there are no odd cases like a forever increase/decrease to infinity in either case. There are no local minima, and this becomes a typical problem of convex optimization.

Results

The results of the above implementation are quite interesting. The loss function does not always decrease like depicted for a typical neural network fitting implementation. It jumps up and down as new game states are discovered, but the maximas and the overall running averages reduce over time. In this exact case, the neural networks played around 30,000 games, and then produced a win-rate of 92% versus a random player starting from an empty board. It learned the rules of the game, and learned actions that lead to victories in a training time of approximately 4-5 minutes.

In the following graph, each point represents the average loss of every game played by model1. Notice how there are sudden jumps, when new states are being introduced to the network, but it is showing a general downtrend as it gets closer and closer to the actual Q-table approximations.

The following is a randomly picked example that ‘model1’ (X player) played vs a randomized player:

[ _ _ _ _ ]
[ _ _ _ _ ]
[ _ _ _ _ ]
[ _ _ _ _ ]

======================================

[ _ _ _ _ ]
[ _ _ _ _ ]
[ _ X _ _ ]
[ _ _ _ _ ]

======================================

[_ _ _ _ ]
[_ O _ _ ]
[_ X _ _ ]
[_ _ _ _ ]

======================================

[_ _ _ _ ]
[X O _ _ ]
[_ X _ _ ]
[_ _ _ _ ]

======================================

[_ O _ _ ]
[X O _ _ ]
[_ X _ _ ]
[_ _ _ _ ]

======================================

[_ O _ _ ]
[X O _ _ ]
[_ X _ _ ]
[_ _ X _ ] (model1 wins)

=======================================

Clearly, the bot has learnt the rules of the game, and how to win the game. From my observations, further improvements can be made in the field of defense and even more optimization in the moves is still possible. In my opinion, further microadjustments in the neural network configuration and training rates can lead to an even higher winrate, and possibly even a perfect bot.

That concludes my current experiments in the field of Q-learning with neural networks.

Suggested Reads and study articles:

https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf - Atari with Q-learning. https://www.youtube.com/watch?v=Vz5l886eptw - Hvass Labs Tutorials http://mnemstudio.org/path-finding-q-learning-tutorial.htm - For clarifications on Q-table formations