RLLib: C{++} Template Library to Predict, Control, Learn Behaviors, and Represent Learnable Knowledge using On/Off Policy Reinforcement Learning

RLLib: C`++` Template Library to Predict, Control, Learn Behaviors, and Represent Learnable Knowledge using On/Off Policy Reinforcement Learning

Saminda Abeyruwan (Last modified: 04/25/2014)

Abstract

RLLib is a lightweight C++ template library that implements incremental, standard, and gradient temporal-difference learning algorithms in reinforcement learning. It is an optimized library for robotic applications and embedded devices that operates under fast duty cycles (e.g., ≤ 30 ms). RLLib has been tested and evaluated on RoboCup 3D soccer simulation agents, physical NAO V4 humanoid robots, and Tiva C series launchpad microcontrollers to predict, control, learn behaviors, and represent learnable knowledge.

1 Overview

The RoboCup initiative presents a real-time, dynamic, complex, adversarial, and stochastic multi-agent environments for agents to learn and reason about different problems [6]. Reinforcement Learning (RL) is a framework that provides a set of tools to design sophisticated and hard-to-engineer behaviors [7]. RL agents naturally formalize their learning goals in two layers:

physical layers - where controlling or predicting of functions relate to sensorimotor interactions such as walking, kicking, etc.; and
decision layers - with dynamically emerging behaviors through actions, options, or knowledge representations.

Therefore, the RL paradigm includes controlling and prediction; it also provides means to represent highly expressive knowledge using General Value Functions (GVFs) from sensorimotor streams [10]. RLLib is a lightweight C++ template library that implements incremental, standard, and gradient temporal-difference learning algorithms in RL to effectively solve problems defined in physical and decision layers. The library is designed and written specifically for robotic applications and embedded devices such as RoboCup 3D soccer simulation agents¹, physical NAO V4 humanoid robots², and Tiva C series launchpad microcontrollers³ that operate under fast duty cycles. The library is released on May 17, 2013⁴ under the open source license "Apache License, Version 2.0", and the latest release version is v2.0.

1.1 Related Work

There are many RL development platforms published by researches for specific RL problems and for general use. Notably, the most common RL development platform is RL-GLUE⁵ with RL-LIBRARY⁶ packages [25]. It provides a language-independent platform over text based massage passing among agents and environments. RL-GLUE is being used in RL competitions in ICML and NIPS workshops. PyBrain⁷[15] and RLPy⁸ are libraries written in Python to formulate and learn from RL problems. RL toolbox⁹, libpgrl¹⁰, YORLL¹¹, and rllib¹²[] are C++ based platforms to develop RL algorithms in different scenarios, while CLSquare¹³[] is a standardized platform for testing RL problems with on-policy batch controllers. PIQLE¹⁴[5], MMF¹⁵, QCON¹⁶, and RLPark¹⁷ are Java platforms that model and learn from RL problems. MDP Toolbox¹⁸ is an Octave based RL development platform. dotRL¹⁹[14] is a .NET platform for rapid RL method development and validation.

RLLib is significantly differs from the existing platforms because of the following:

the library is written and designed specifically for applications and devices where limited computational resources are available. Therefore, the memory footprint as well as the computational requirements that are needed by the library has been optimized as minimal as possible;
a configurable C++ template functions to synchronize with application or device hardware requirements;
the library emphasizes more on learnable knowledge representation and reasoning from sensorimotor streams;
a clean and transparent API exists that enables users to model their RL problems easily; and
a self-contained C++ template library covers plethora of incremental, standard, and gradient temporal-difference learning algorithms in RL that is published to-date (e.g., [18,24]). Our library has been successfully used in [1] to learn role assignment in RoboCup 3D soccer simulation agents.

1.2 Features

RLLib features and implements:

off-policy prediction algorithms: (GTD(λ) and GQ(λ)) [10];
off-policy control algorithms: (Greedy-GQ(λ)) [10] and (softmax GQ(λ) and Off-PAC) [4];
on-policy algorithms: (TD(λ), SARSA(λ), expected SARSA(λ), and actor-critic (continuous and discrete actions, discounted, averaged reward settings, etc.)) [21], (alpha bound TD(λ) and SARSA(λ)) [3], and (true TD(λ) and SARSA(λ)) [17];
incremental supervised learning algorithms: (adaline) [2], (IDBD and semi-linear IDBD) [22], and (auto-step) [13];
discrete and continuous policies: (random, random X percent bias, greedy, ϵ-greedy, Boltzmann, Normal, and softmax);
sparse feature extractors (e.g., tile coding) with pluggable hash functions [21];
an efficient implementation of the dot product for sparse coder based feature representations;
benchmark environments: (mountain car, mountain car 3D, swinging pendulum, helicopter, and continuous grid world) [21];
optimization for very fast duty cycles (e.g., using culling traces, RLLib is tested on the RoboCup 3D simulator agent, physical NAO V4 robot (cognition thread), and Tiva C series launchpad microcontrollers);
a framework to design complex behaviors, predictors, controllers, and represent highly expressive learnable knowledge representations in RL using GVFs;
a framework to visualize benchmark problems; and
a plethora of examples demonstrating on-policy and off-policy control experiments.

2 Prelude

The standard RL framework²⁰ models an AI agent and an environment interactions in discrete time steps t=1,2,3,…. The agent senses the state of the world at each time step S_t ∈ S and selects an action A_t ∈ A. One time step later the agent receives a scalar reward R_t+1 ∈ R, and senses the state S_t+1 ∈ S. The rewards are generated according to a reward function r:S_t+1→ R. The objective of the standard RL framework is to learn the stochastic action-selection policy π: S ×A → [0,1], that gives the probability of selecting each action in each state, π(s, a) = π(s|a) = P(A_t = a|S_t = s), such that the agent maximizes rewards summed over the time steps [21].

Recently, within the context of the RL framework, a knowledge representation language has been introduced, that is expressive and learnable from sensorimotor data. This representation is directly usable for robotic soccer as agent-environment interactions are conducted through perceptors and actuators. In this approach, knowledge is represented as a large number of approximate value functions each with its:

own policy;
a pseudo-reward function;
a pseudo-termination function; and
a pseudo-terminal-reward function.

The interpretation of the approximate value function as a knowledge representation language grounded on information from perceptors and actuators is defined as: the knowledge expressed as an approximate value function is true or accurate, if its numerical values matches those of the mathematically defined value function it is approximating. Therefore, a value function asks a question, and an approximate value function provides an answer to that question. Based on prior interpretation, the standard RL framework extends to represent learnable knowledge as follows. The standard RL framework extends to include a terminal-reward-function, z:S → R, where z(S_t) is the terminal reward received when the termination occurs in state S_t. In the RL framework, γ ∈ [0,1) is used to discount delayed rewards. Another interpretation of the discounting factor is a constant probability of 1−γ termination of arrival to a state with zero terminal-reward. This factor is generalized to a termination function γ:S → [0,1], where 1− γ(S_t) is the probability of termination at state S_t, and a terminal reward z(S_t) is generated.

In continuous state or action spaces, approximate value functions are learned using function approximation. They are learned using efficient on- or off-policy learning algorithms in RL. We briefly introduce some of the important concepts related to the GVFs. The complete information about the GVFs are available in [10,11,12,23].

2.1 Off-Policy Action-Value Methods for GVFs

The first method to learn about GVFs, from off-policy experiences, is to use action-value functions. Let G_t be the complete return from state S_t at time t, then the sum of the rewards (transient plus terminal) until termination at time T is: G_t = ∑_k=t+1^T r(S_k) + z(S_T). The action-value function is q^π(s,a) = E(G_t|S_t = s, A_t = a, A_t+1:T−1 ∼ π, T ∼ γ), where, q^π:S×A→ R. This is the expected return for a trajectory started from state s, and action a, and selecting actions according to the policy π, until termination occurs with γ. We approximate the action-value function with ∧q:S×A→ R. Therefore, the action-value function is a precise grounded question, while the approximate action-value function offers the numerical answer.

The GVFs are defined over four functions: π, γ, r,and z. The functions r and z act as pseudo-reward and pseudo-terminal-reward functions respectively. Function γ is also in pseudo form as well. However, γ function is more substantive than reward functions as the termination interrupts the normal flow of state transitions. In pseudo termination, the standard termination is omitted. In robotic soccer, the base problem can be defined as time until a goal is scored by either the home or the opponent teams. We can consider a pseudo-termination has occurred when a striker is changed. GVF with respect to a state-action function is defined as: q^{π,γ, r,z}(s,a) = E(G_t|S_t = s, A_t = a,A_t+1:T−1 ∼ π, T ∼ γ). The four functions, π, γ, r,and z, are the question functions to GVFs, which in return defines the general value function's semantics.

An RL agent learns an approximate action-value function, ∧q, using the four auxiliary functions π,γ, r and z. We assume that the state space is continuous and the action space is discrete. We approximate the action-value function using a linear function approximation. A practitioner can use a feature extractor ϕ: S_t ×A_t → {0,1}^N, N ∈ N, built on tile coding [21] to generate feature vectors from state variables and actions. This is a sparse vector with a constant number of "1" features, hence, a constant norm. In addition, tile coding has the key advantage of real-time learning and to implement computationally efficient algorithms to learn approximate value functions. In linear function approximation, there exists a weight vector, θ ∈ R^N, N ∈ N, to be learned. Therefore, the approximate GVFs are defined as: ∧q(s,a,θ)=θ^Tϕ(s,a), such that, ∧q:S ×A×R^N → R. Weights are learned using off-policy algorithms implemented in section 1. The algorithms learn stably and efficiently using linear function approximation from off-policy experiences. Off-policy experiences are generated from a behavior policy, π_b, that is different from the policy being learned about named as target policy, π. Therefore, one could learn multiple target policies from the same behavior policy.

Example:

  #incude  "Math.h"
  #include "ControlAlgorithm.h"
  #include "MountainCar.h"
  #include "RL.h"
  
  using namespace RLLib;
  
  Random<double>* random = new Random<double>;
  RLProblem<double>* problem = new MountainCar<double>;
  
  Hashing<double>* hashing = new MurmurHashing<double>(random, 1000000);
  Projector<double>* projector = new TileCoderHashing<double>(hashing, problem->dimension(), 10, 10, true);
  StateToStateAction<double>* toStateAction = 
                           new StateActionTilings<double>(projector, problem->getDiscreteActions());
  
  Trace<double>* e = new ATrace<double>(projector->dimension());
  
  double alpha_v = 0.1 / projector->vectorNorm();
  double alpha_w = 0.0001 / projector->vectorNorm();
  double gamma_tp1 = 0.99;
  double beta_tp1 = 1.0 - gamma_tp1;
  double lambda_t = 0.4;
  GQ<double>* gq = new GQ<double>(alpha_v, alpha_w, beta_tp1, lambda_t, e);
  
  Policy<double>* behavior = new RandomPolicy<double>(random, problem->getDiscreteActions());
  Policy<double>* target = new Greedy<double>(problem->getDiscreteActions(), gq);
  
  OffPolicyControlLearner<double>* control = 
	   new GreedyGQ<double>(target, behavior, problem->getDiscreteActions(), toStateAction, gq);

  RLAgent<double>* agent = new LearnerAgent<double>(control);
  Simulator<double>* sim = new Simulator<double>(agent, problem, 5000, 100, 10);
  sim->setTestEpisodesAfterEachRun(true);
  sim->run();
  sim->computeValueFunction();

  // Delete objects

2.2 Off-Policy Policy Gradient Methods for GVFs

The second method to learn about GVFs is using the off-policy policy gradient methods with actor-critic architectures that use a state-value function suitable for learning GVFs. It is defined as: v^{π,γ, r,z}(s) = E(G_t|S_t = s,A_t:T−1 ∼ π, T ∼ γ), where, v^{π,γ, r,z}(s) is the true state-value function. The approximate GVF is defined as: ∧v(s,v)=v^Tϕ(s), where, v ∈ R^N_v, N_v ∈ N, and the functions π,γ, r, and z are defined as in the subsection (2.1). If the target policy π is discrete stochastic, a practitioner can use a Gibbs distribution of the form: π(a | s) = [(e^{u^T ϕ(s,a)})/(∑_be^{u^T ϕ(s, b)})], where, ϕ(s,a) are state-action features for state s, and action a, which are in general unrelated to state features ϕ(s), that are used in state-value function approximation. u ∈ R^N_u,N_u ∈ N, is a weight vector, which is modified by the actor to learn about the stochastic target policy. The log-gradient of the policy at state s, and action a, is: [(∇_u π(a | s))/(π(a | s))] = ϕ(s,a) − ∑_b π(b|s)ϕ(s,b).

Example:

 
  #incude  "Math.h"
  #include "ControlAlgorithm.h"
  #include "MountainCar.h"
  #include "RL.h"
  
  using namespace RLLib;
  
  Random<float>* random = new Random<float>;
  RLProblem<float>* problem = new MountainCar<float>;
  
  Hashing<float>* hashing = new MurmurHashing<float>(random, 1000000);
  Projector<float>* projector = new TileCoderHashing<float>(hashing, problem->dimension(), 10, 10, true);
  StateToStateAction<float>* toStateAction 
			  = new StateActionTilings<float>(projector, problem->getDiscreteActions());

  double alpha_v = 0.05 / projector->vectorNorm();
  double alpha_w = 0.0001 / projector->vectorNorm();
  double lambda = 0.0;  //0.4;
  double gamma = 0.99;
  double alpha_u = 1.0 / projector->vectorNorm();
  
  Trace<float>* critice = new ATrace<float>(projector->dimension());
  OffPolicyTD<float>* critic = new GTDLambda<float>(alpha_v, alpha_w, gamma, lambda, critice);
  
  PolicyDistribution<float>* target = 
    new BoltzmannDistribution<float>(random, problem->getDiscreteActions(), projector->dimension());
  Policy<float>* behavior = new RandomPolicy<float>(random, problem->getDiscreteActions());

  Trace<float>* actore = new ATrace<float>(projector->dimension());
  Traces<float>* actoreTraces = new Traces<float>();
  actoreTraces->push_back(actore);
  ActorOffPolicy<float>* actor = 
                      new ActorLambdaOffPolicy<float>(alpha_u, gamma, lambda, target, actoreTraces);

  OffPolicyControlLearner<float>* control = 
                               new OffPAC<float>(behavior, critic, actor, toStateAction, projector);

  RLAgent<float>* agent = new LearnerAgent<float>(control);
  Simulator<float>* sim = new Simulator<float>(agent, problem, 5000, 100, 10);
  sim->setTestEpisodesAfterEachRun(true);
  //sim->setVerbose(false);
  sim->run();
  sim->computeValueFunction();
  control->persist("visualization/mcar_offpac.data");

  control->reset();
  control->resurrect("visualization/mcar_offpac.data");
  sim->runEvaluate(10, 10);
  
  // Delete objects

2.3 On-Policy Action-Value and Policy Gradient Methods for GVFs

Similar to subsections 2.1 and 2.2, we can use the extended RL framework to learn about GVFs from on-policy experiences. In this formulation, the behavior policy and the target policy are similar, but the experiences are generated according to an exploratory behavior policy. All off-policy algorithms can be modified to be used in on-policy setting, and RLLib provides a transparent API to handle these situations.

Example:

  #incude  "Math.h"
  #include "ControlAlgorithm.h"
  #include "SwingPendulum.h"
  #include "RL.h"
  
  using namespace RLLib;
  
  Random<double>* random = new Random<double>;
  RLProblem<double>* problem = new SwingPendulum<double>;
  Hashing<double>* hashing = new MurmurHashing<double>(random, 1000);
  Projector<double>* projector = new TileCoderHashing<double>(hashing, problem->dimension(), 10, 10, false);
  StateToStateAction<double>* toStateAction 
		       = new StateActionTilings<double>(projector, problem->getContinuousActions());

  double alpha_v = 0.1 / projector->vectorNorm();
  double alpha_u = 0.001 / projector->vectorNorm();
  double alpha_r = .0001;
  double gamma = 1.0;
  double lambda = 0.5;

  Trace<double>* critice = new ATrace<double>(projector->dimension());
  TDLambda<double>* critic = new TDLambda<double>(alpha_v, gamma, lambda, critice);

  PolicyDistribution<double>* policyDistribution 
  = new NormalDistributionScaled<double>(random, 
                                   problem->getContinuousActions(), 0, 1.0, projector->dimension());
  Range<double> policyRange(-2.0, 2.0);
  Range<double> problemRange(-2.0, 2.0);
  PolicyDistribution<double>* acting = 
  new ScaledPolicyDistribution<double>(problem->getContinuousActions(), 
                                                   policyDistribution, &policyRange, &problemRange);

  Trace<double>* actore1 = new ATrace<double>(projector->dimension());
  Trace<double>* actore2 = new ATrace<double>(projector->dimension());
  Traces<double>* actoreTraces = new Traces<double>();
  actoreTraces->push_back(actore1);
  actoreTraces->push_back(actore2);
  ActorOnPolicy<double>* actor = new ActorLambda<double>(alpha_u, gamma, lambda, acting, actoreTraces);

  OnPolicyControlLearner<double>* control
           = new AverageRewardActorCritic<double>(critic, actor, projector, toStateAction, alpha_r);

  RLAgent<double>* agent = new LearnerAgent<double>(control);
  Simulator<double>* sim = new Simulator<double>(agent, problem, 5000, 100, 10);
  sim->setVerbose(true);
  sim->run();

  sim->runEvaluate(100);
  sim->computeValueFunction();

  // Delete objects

3 Platform

The RLLib library closely follows the design principles and recommendations presented in [9] and [20]. The development of the library has taken significant efforts to minimize memory footprint as well as computational requirements that are requested by RL problems. Since RLLib specifically emphasizes on learnable knowledge representation and reasoning, it has been modularized based on on-policy and off-policy RL methods. In addition, RLLib provides implementation of incremental supervised learning algorithms that can be used simultaneously with RL problems.

Listings 1 and 2 provide the minimal pseudocode to setup off-policy RL agents and environments as described in subsections 2.1 and 2.2. The controlling, behavior learning, and learnable knowledge representation problems should use "ControlAlgorithm.h" header file. The algorithms related to predictions and supervised learning problems are implemented in "PredictorAlgorithm.h" and "SupervisedAlgorithm.h" header files. All C++ templates implemented in the library are under the namespace RLLib, and use a single parameter T. This parameter could be a C++ primitive type as shown in listings 1 and 2 or a complex object defined by the user. It is our experience that majority of RL problems can be defined using a primitive type. Devices such as Tiva C launchpad microcontrollers supports single precision floating point representations. Therefore, the templates should be initialized with float primitive type.

// Listing 1 ---------------------------------------	// Listing 2 ----------------------------------------
 1 include "ControlAlgorithm.h"				1 include "ControlAlgorithm.h"
 2 include "RL.h"					2 include "RL.h"
 3 using namespace RLLib;				3 using namespace RLLib;
// RL Problem --------------------------------------	// RL Problem --------------------------------------
 4 RLProblem<double>*  problem = ...;			4 RLProblem<double>*  problem = ...;
// Projector ---------------------------------------	// Projector ---------------------------------------
 5 Hashing<double>* hashing = ...;			5 Hashing<double>* hashing = ...;
 6 Projector<double>*  projector = ...;			6 Projector<double>*  projector = ...;
 7 StateToStateAction<double>*  toStateAction = ...;	7 StateToStateAction<double>*  toStateAction = ...;
// Predictor ---------------------------------------	// Critic ------------------------------------------
 8 Trace<double>* e = ...;				8 Trace<double>* critice = ...;
 9 GQ<double>* gq = ...;				9 GTDLambda<double>* critic = ...;
// Policies $\pi$ and $\pi_b$ ----------------------	// Policies $\pi$ and $\pi_b$ ----------------------
10 Policy<double>*  target = ...;			10 PolicyDistribution<double>*  target = ...;
11 Policy<double>*  behavior =...;			11 Policy<double>* behavior =...; 
// -------------------------------------------------	// Actor -------------------------------------------
12							12 Traces<double>*  actoreTraces = ...;
13							13 ActorOffPolicy<double>*  actor = ...;
// Controller --------------------------------------	// Controller --------------------------------------  
14 OffPolicyControlLearner<double>*  control = ...;	14 OffPolicyControlLearner<double>*  control = ...;
// Runner ------------------------------------------	// Runner ------------------------------------------   
15 RLAgent<double>*  agent = ...;			15 RLAgent<double>*  agent = ...;
16 Simulator<double>*  sim = ...;			16 Simulator<double>*  sim = ...;
17 sim->run(); // OR sim->step();			17 sim->run(); // OR sim->step();
// -------------------------------------------------	// -------------------------------------------------

In line 4, we define the RL problem using an instance of the template RLProblem<T>. This template as well as RLAgent<T> and Simulator<T> (lines 15-16) templates are defined in "RL.h" header file. An instance of the template Projector<T> (line 6) extracts features from the state variables. These features are part of a function approximation architecture (linear or non-linear), i.e., tile based sparse features (e.g., [21,Section 8.3.2]) or compact features (e.g., [8]), suitable for the problem. Some feature extractors require a hashing function, that is defined in line 5. For action-value functions, the features could also include actions (or options), that is defined in StateToStateAction<T> (line 7). In listing 1, lines 8-9 define the predictor, which is used in off-policy controller (line 14), while listing 2 defines the critic that is used in the actor-critic controller. Lines 12-13 in listing 2 define the actor for the prior controller. Lines 10-11 define the target policy (the smooth policy distributions in listing 2) and the behavior policy. Line 15 defines the RL agent that is used in the simulator (line 16) simultaneously with the RL problem.

In simulations (e.g., [19]), specifically when the simulator has the control over the perception-actuation cycles, the runner is executed with run() in line 17. In practical problems (e.g., [16]), where the agents and the environments run on disjoint processes, the runner will wait for the percepts, then updates the agent, which in return transmits the actions to the environment. In such situations, the runner is executed with step() in line 17. It is to be noted that either in simulations or in practical problems the runner will execute the same update steps, such that, the user will experience the same set of execution steps. A practitioner can construct the C++ objects in lines 4-16 in an initialization subroutine, and execute line 17 on a subroutine that calls on every duty cycle.

There are complex combination of RL algorithms used in practice. RLLib allows many combination of these algorithms by changing a few lines of code in listings 1 and 2. For example, in oder to implement on-policy action-value methods in subsection 2.3, a practitioner can change the predictor in line 9 to Sarsa<T> and the controller in line 14 to OnPolicyControlLearner<T> in listing 1. Similarly, different combination of RL algorithms can be included in listing 2 for actor-critic, and parameterized policies.

4 Software and Conclusion

The RLLib platform and its features, that are presented in sections 3 and 1, have been fully implemented, empirically tested, and released to public from project website at: http://web.cs.miami.edu/home/saminda/rllib.html. In addition, RLLib provides testbeds, intuitive visualization tools, and extension points for complex combination of RL algorithms, agents, and environments. Compared to existing platforms, RLLib can be deployed on many different hardware configurations due to its low memory footprint and computational efficiency. We are planning to embed RLLib as a machine learning platform for compatible devices in Energia²¹, an open-source electronics prototyping platform, in future.

References

[1]: Abeyruwan, S., Seekircher, A., Visser, U.: Dynamic Role Assignment using General Value Functions. In: AAMAS 2013, ALA Workshop (2013)
[2]: Bishop, C.M.: Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 1 edn. (2007)
[3]: Dabney, W., Barto, A.G.: Adaptive Step-Size for Online Temporal Difference Learning. In: AAAI (2012)
[4]: Degris, T., White, M., Sutton, R.S.: Off-Policy Actor-Critic. In: Proceedings of the Twenty-Ninth International Conference on Machine Learning (ICML) (2012)
[5]: Delepoulle, F.D.C.S.: PIQLE : A Platform for Implementation of Q-Learning Experiments. In: Neural Information Processing Systems 2005, Workshop on Reinforcement Learning benchmarks and Bake-off II (2005)
[6]: Kitano, H., Asada, M., Kuniyoshi, Y., Noda, I., Osawai, E., Matsubara, H.: RoboCup: A challenge problem for AI and robotics. In: Kitano, H. (ed.) RoboCup-97: Robot Soccer World Cup I, Lecture Notes in Computer Science, vol. 1395, pp. 1-19. Springer Berlin Heidelberg (1998)
[7]: Kober, J., Peters, J.: Reinforcement learning in robotics: A survey. In: Reinforcement Learning, pp. 579-610. Springer (2012)
[8]: Konidaris, G., Osentoski, S., Thomas, P.S.: Value Function Approximation in Reinforcement Learning Using the Fourier Basis. In: AAAI (2011)
[9]: Kovacs, T., Egginton, R.: On the analysis and design of software for reinforcement learning, with a survey of existing systems. Machine Learning 84(1-2), 7-49 (2011)
[10]: Maei, H.R.: Gradient Temporal-Difference Learning Algorithms. PhD Thesis, University of Alberta. (2011)
[11]: Maei, H.R., Sutton, R.S.: GQ(λ): A General Gradient Algorithm for Temporal-Difference Prediction Learning with Eligibility Traces. Proceedings of the 3rd Conference on Artificial General Intelligence (AGI-10) pp. 1-6 (2010)
[12]: Maei, H.R., Szepesvári, C., Bhatnagar, S., Sutton, R.S.: Toward Off-Policy Learning Control with Function Approximation. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010). pp. 719-726 (2010)
[13]: Mahmood, A.R., Sutton, R.S., Degris, T., Pilarski, P.M.: Tuning-free step-size adaptation. In: Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. pp. 2121-2124. IEEE (2012)
[14]: Papis, B., Wawrzynski, P.: dotRL: A platform for rapid Reinforcement Learning methods development and validation. In: Computer Science and Information Systems (FedCSIS), 2013 Federated Conference on. pp. 129-136 (Sept 2013)
[15]: Schaul, T., Bayer, J., Wierstra, D., Sun, Y., Felder, M., Sehnke, F., Rückstieß, T., Schmidhuber, J.: PyBrain. Journal of Machine Learning Research 11, 743-746 (2010)
[16]: Seekircher, A., Abeyruwan, S., Visser, U.: Accurate Ball Tracking with Extended Kalman Filters as a Prerequisite for a High-Level Behavior with Reinforcement Learning. In: The 6th Workshop on Humanoid Soccer Robots at Humanoid Conference, Bled (Slovenia) (2011)
[17]: van Seijen, H., Sutton, R.S.: True Online TD(λ). In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China (2014)
[18]: Sigaud, O., Buffet, O.: Markov Decision Processes in Artificial Intelligence. John Wiley & Sons (2013)
[19]: Sutton, R.S.: Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in neural information processing systems pp. 1038-1044 (1996)
[20]: Sutton, R.S.: A Standard Interface for Reinforcement Learning Software in C++ (Last visited on April 11^th, 2014), http://webdocs.cs.ualberta.ca/ sutton/RLinterface/RLI-Cplusplus.html
[21]: Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (1998)
[22]: Sutton, R.S., Koop, A., Silver, D.: On the role of tracking in stationary environments. In: Proceedings of the 24th international conference on Machine learning. pp. 871-878. ACM (2007)
[23]: Sutton, R.S., Modayil, J., Delp, M., Degris, T., Pilarski, P.M., White, A., Precup, D.: Horde: A Scalable Real-Time Architecture for Learning Knowledge from Unsupervised Sensorimotor Interaction. In: The 10th International Conference on Autonomous Agents and Multiagent Systems. pp. 761-768. AAMAS '11, International Foundation for Autonomous Agents and Multiagent Systems (2011)
[24]: Szepesvári, C.: Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, Morgan & Claypool Publishers (2010)
[25]: Tanner, B., White, A.: RL-Glue: Language-independent software for reinforcement-learning experiments. The Journal of Machine Learning Research 10, 2133-2136 (2009)

Footnotes:

¹ http://simspark.sourceforge.net/wiki/

² http://www.aldebaran.com/

³ http://www.ti.com/tool/EK-TM4C123GXL

⁴ http://tinyurl.com/rllib-1-0

⁵ http://glue.rl-community.org

⁶ http://library.rl-community.org

⁷ http://pyrain.org/

⁸ http://acl.mit.edu/RLPy/

⁹ http://www.igi.tu-graz.ac.at/gerhard/ril-toolbox/general/overview.html

¹⁰ https://code.google.com/p/libpgrl/

¹¹ http://www.cs.york.ac.uk/rl/software.php

¹² http://malis.metz.supelec.fr/spip.php?article122

¹³ http://ml.informatik.uni-freiburg.de/research/clsquare

¹⁴ http://piqle.sourceforge.net/

¹⁵ http://mmlf.sourceforge.net/

¹⁶ http://sourcef orge.et/p/elsy/wiki/Home/

¹⁷ http://rlpark.github.io/

¹⁸ http://www7.inra.fr/mia/T/MDPtoolbox/

¹⁹ http://sourceforge.net /projects/dotrl/

²⁰We represent random variables by capital letters (e.g., S_t,R_t), vectors by bold faced letters (e.g., θ, v, ϕ), functions by lowercase letters (e.g., v, π), and sets by calligraphic font (e.g., S,A).

²¹ http://energia.nu/

File translated from T_EX by T_TH, version 4.01.
On 25 Apr 2014, 02:58.

RLLib: C++ Template Library to Predict, Control, Learn Behaviors, and Represent Learnable Knowledge using On/Off Policy Reinforcement Learning