agent Af generates rewards for itself: if (just picked up food) reward = 0.7 else reward = 0 agent An generates rewards for itself: if (just arrived at nest) reward = 0.1 else reward = 0 agent Ap generates rewards for itself: if (just shook off predator (no longer visible)) reward = 0.5 else reward = 0
Each timestep, the creature senses a state x, each agent inside the creature suggests an action (there may be only one agent inside the creature), some agent Ak wins the internal competition and has its action a executed, then the creature senses a new state y. The caption line of the movies shows each step:
x [Ak] a -> y
Af movie, 100 steps.
If this does not play, try this:![]()
Af senses the direction of visible food within a small radius
(including a value for "none visible").
By Q-learning,
Af builds up these Q-values.
These values mean that it learns to seek out food when the creature is not carrying any,
but then it is at a loss what to do.
The only way it can gain any future rewards is to lose the piece of food at the nest,
but it cannot learn how to do this because it does not sense the nest.
So it just wanders about.
If it should accidentally wander into the nest and lose its food,
it immediately sets off in search of more, and once successful, will be aimless again.
And so on. It completely ignores the predator.
Next we watch the creature under the control of agent An:
An movie, 100 steps.
If this does not play, try this:![]()
An senses the direction of the nest within a small radius.
By Q-learning, An builds up these Q-values.
If the nest is not visible, An wanders randomly.
Once it is visible, An heads straight to it and then, instead of staying put,
learns to jump out and back in so it can get that "just arrived at nest" reward again and again!
It is happy maximising its rewards, ignoring both food and predators.
Then we watch the creature under the control of agent Ap:
Ap movie, 100 steps.
If this does not play, try this:![]()
Ap senses the direction of the predator.
By Q-learning, Ap builds up these Q-values.
If the predator is visible, Ap learns to move away from it
in the broad opposite diagonal direction.
When the predator has gone out of sight, Ap doesn't actually stay put,
but wanders randomly in the hope that the predator comes back into sight
so it can get the "just shook off predator" reward again!
It almost looks as if it is baiting the predator - repeatedly coming near it
and then withdrawing.
It ignores food.
By W-learning, the competing agents build up these W-values. These values mean that Af is generally obeyed if the creature is not carrying food, sometimes with competition from Ap when a predator is visible. If the creature is carrying food, Af has no strong opinions about what to do, and Ap is free to dominate if a predator is visible. If no predator is visible, then Ap has no strong opinions either (apart from not wanting to stay still) and the weak but constant signalling of An is finally audible. The result is a predator-avoiding, food-foraging creature in which, at every timestep, 2 of the agents are not being listened to.
We watch the creature under the control of the 3 competing agents Af, An, Ap:
3 agents movie, 300 steps.
If this does not play, try this:![]()
Note in the caption line how control switches from agent to agent.
One thing that helps the agents live together successfully is that they are all restless agents.
Not one of them ever wants to stay still, no matter what is happening.
This makes it easy for another agent to suggest a movement somewhere.
We can draw
a map of the statespace
showing how control is divided up.
3 agents "before" movie.
If this does not play, try this:![]()
By Q-learning, rewards are propagated into Q-values,
and by W-learning, the differences between Q-values are propagated into W-values,
until the creature finally
settles down into a steady pattern of behavior:
3 agents "after" movie.
If this does not play, try this:![]()
To bundle this animation into an MPEG file, I get gnuplot to dump each plot into its own pbm file. The pbm files can then be strung together frame-by-frame into an MPEG.
These movies are also available on a "video appendix"
deposited with the 1996 version
(PhD 20843) of my PhD thesis
in the
Manuscripts Room
of Cambridge University Library.
This VHS video tape plays the 4 Movies above in sequence.
First, the creature under the control of agent Af alone.
Then An alone.
Then Ap alone.
Then all 3 competing together in the same body.
Return to my home page.