PRM and RRT Motion Planning for a LoCoBot

In this assignment the goal was to plan and execute a motion from a starting configuration to a goal configuration for a LoCoBot in the presence of known obstacles in the environment. The provided data includes information about the obstacles in the environment (represented as the vertices of a bounding box centered at the geometric center of the obstacle) and joint limits for each of the joints of the LoCoBot.

Both the Probabilistic Road Map (PRM) and the Rapidly-Exploring Random Tree (RRT) methods are probabilistic methods for motion planning. In the PRM approach, a random sampler is used to generate viable points in the robot’s configuration space. A viable sample is one that does not violate joint limits and does not collide with obstacles. If a viable sample is selected, a local planner is used to join the sampled node to the closest node in the tree. In this way the tree is continuously grown until the desired density of nodes are created within the configuration space. This graph can be then saved and queried later on when a motion is required to be performed.

The RRT method is primarily used when paths need to be generated and returned on the fly. A tree is grown starting from the initial configuration by sampling the configuration and checking for viability the same as PRM. If a collision free path exists between the sampled point, it is added to the tree as a valid node. In practice the sampler is biased to sampled from regions close to the goal. This makes the tree grow more rapidly towards the goal state.

The embedded video above shows the result of the generated RRT path.

The clip above shows the robot executing a path that was retrieved from the PRM graph using A* searching.


Reinforcement Learning for a Block Throwing Robot

In this assignment the goal was to train a block throwing robot using reinforcement learning techniques such as Cross Entropy Method, Relative Entropy Policy Search and Bayesian Optimization using an Upper Confidence Bound (UCB) acquisition function.

Cross Entropy Method

In the Cross Entropy Method, the action “policy” is maintained as a normal distribution with a mean and co-variance in a 4-vector space. As shown in the image below, the first two elements of the state vector represented the initial position of the robotic arm and the last two elements represented the final positions of the robotic arm.

State Vector as Initial and Final Configurations of the Arm

The reward function used here was the difference between the desired projectile position and the actual projectile position after simulation. The “learning” here was done by first initializing the normal distribution with an approximate mean and covariance for the policy 4-vector. Random samples drawn from the distribution described by the initial mean and covariance were then used to simulate “N” throws on V-Rep/CoppeliaSim and the corresponding projectile distance reward metrics recorded. The policy samples corresponding to the top “k” rewards were then used to define the new mean and covariance. This entire process was repeated for a fixed number of iterations until the distribution described the policy that consistently generated a high reward.

Relative Entropy Policy Search

The relative entropy search method once again averages the mean and covariance of the normal distribution that characterizes the policy; however, the difference here is the way in which the relative weights for the rewards are computed. In REPS, the rewards are used to minimize a “dual function” parametrized by η. The dual function used for optimization is shown below:

Dual Function used for Optimization

“returns” is an array that contains the rewards of the policy rollout for “n” trials. “R” is the array of “returns” minus the maximum reward value in “returns”. “k” is a constant used to scale the bounds for the optimized η. Consider the middle term in the function. For very small values of η, the evaluation of e^(R/η) leads to an array of weights with exaggerated relative magnitudes. For larger values of η, e^(R/η) leads to nearly identical weights even if the rewards are significantly different. Ignoring the first term which allows for scaling the optimization, the second and third terms tend to zero as η tends to infinity. The first term prevents η from blowing up.

The following image shows plots of the dual function with k=0 in the left column and k=1.5 in the right column. For k=0, the optimal value of η that minimizes the dual functions is large while for k=1.5 the value of η is near zero. This implies that setting k=0 will lead to weights that have very close relative weights while k=1.5 will lead to weights that have exaggerated relative differences. The rows correspond to increasing reward values from the top to the bottom. The first row has the lowest rewards for the policy rollout while the last row has the best performance.

Behavior of Dual Function with η and for Different Reward Values

The weights computed from the term e^(R/η) was then used to find the new weighted mean and covariance using the samples used for the policy rollout.

Bayesian Optimization using an Upper Confidence Bound

In the Bayesian Optimization approach a Gaussian Process was used to capture the simulator behavior with the Upper Confidence Bound acquisition functions used for maximization. The process followed was to first sample two states from an initial policy rollout the behavior on the simulator and use the results to capture two data points for the Gaussian Process (GP). Once an initial GP was obtained, “k” sampled policies were tested using the GP and the results recorded. The reward function used was the Upper Confidence Bound acquisition function shown below:

Upper Confidence Bound Acquisition Function

The UCB approach can be biased to favor either an exploratory policy search or a high performing policy search based on the term λ. A higher λ favors more exploration emphasizing the standard deviation term of the GP while a lower λ favors higher performance by reducing the contribution of the standard deviation term. In this solution λ was set to 1 for equal bias. The “k” samples and rewards obtained from the GP were then fitted using CEM as explained in the earlier section. This gave the new policy distribution based on the GP. This process of sampling “k” samples from the policy, evaluating the GP and fitting the policy based on the results of the GP were repeated “n” times. After the “n” iterations of CEM policy fitting, the new mean of the policy was rolled out on the simulator and result recorded. This data pair was added to the training sample set of the GP and the GP retrained. The GP itself was retrained with additional samples “m” times repeating the entire sampling and CEM fitting steps.


Shelf Stocking Robot aka Shelf Help | Course Project

The Robot Autonomy course also had an associated course project which was implemented in teams of 4. The following slide deck and video clip show the performance of the shelf stocking robot created by our team. The platform used was a Franka Emika Panda robotic arm with seven DOF.