Cognitive science makes the claim that intelligent action is centered in the brain/spinal column/neurons. The analogy is that the brain is the hardware on which cognition, consciousness and the like runs as a program.

The Physical Symbol System Hypothesis makes the assertion that intelligent action is the product of a physical symbol system and that therefore the intelligent action that we perform daily is explained as a physical symbol system.

However, what we have in the brain is the "cold porridge" that Hugh Whitemore has Turing mention at the start of the second act of his play. And no matter how we dissect the brain, we are unlikely to find symbol structures. No semantic networks. No frames. Not a script in sight.

Well, says the GOFAI enthusiast, what the physical symbol system hypothesis does is to provide us an explanation for intelligent activity at the algorithm/data structure level, and the neural network that is the brain provides us with an explanation on the implementation level. But that is not a completely satisfactory explanation. In Marr's analysis of an information processor, each level of explanation constrains the ones around it. And until we understand better how a physical symbol system can be implemented on a neural network perhaps we should begin with neural networks. The brain is, after all, a highly interconnected network of a very great many neurons. Perhaps, this second faction argues, we should begin with this model for computation and see where it leads us. Let's begin.

Biologically, a neuron is a unit containing a nucleus, one or more extensions accepting input from other neurons, and one or more extensions producing output to other neurons. With a great many apologies to the biologists, we model this by a unit accepting numeric values from a series of inputs, applying a function of the weighted sum of inputs to produce an output.

(a neat picture would go here if I knew how to do that)

Suppose that we are the first explorers on a rather odd planet. The surface is bleak, and not much is going on of any interest, but there are some interesting creatures. The one that we become most interested in is a slug-like creature which feeds on small green bugs. The life of this slug-like creature is pretty good except for a somewhat larger blue bug which feeds on these interesting slugs.

The slugs have two eye-like organs on what appears to be the head of the slug. One detects color (green and blue), and one, a sonar-like device, senses size. The slug has a foot of sorts which can either propel it towards the object in view and or away from it. Further examination reveals that the nervous system of the slug is really very simple: there are only two neurons in the system! One of them is connected to the two eye-like organs and to the foot. When it is active (that is, when it fires) the slug approaches whatever is in its view (intending to eat it). The other is arranged in exactly the same way, but when it fires, the slug moves away from whatever is in its view.

How does this work? Let's take a very simple view of a slug neuron. It consists of two inputs, some simple processing taking place in the nucleus of the neuron, and an output connected to the tail. Let's consider the neuron controlling whether or not to approach an object in view. To further simplify our discussion, let us suppose that when something green (food) is in view, the "eye" activates its input to the neuron, but otherwise the "eye" does not activate its neuron. For further convenience, we will assign a value of 1 for green, and a value of 0 for anything else. Similarly, the "sonar" connected to the same neuron will activate its input to the neuron under consideration when something large is in view, but will not activate its input to our neuron otherwise.

When do we want the neuron to "fire" - that is, activate the foot into motion towards whatever is in view? We want to do that if the object in view is green, but not large. That is, we want to do that when the "eye" reports a 1 (indicating a green object), and when the "sonar" reports a 0 (indicating that the object is not large). The safest approach would be not to move towards the object if it is either large or not green. A similar discussion would tell us when we would want to activate the "running away" neuron.

So what should the "processing element" (the nucleus) do with the inputs? The nucleus will weight the two inputs and add them up. If the weighted sum is bigger than zero, the slug will approach the object intending to eat it. Otherwise, this neuron will not instruct the foot to do anything. (Of course, the other neuron may be frantically telling the foot to run away.) How can we model this numerically. A table may help in understanding what to do for this neuron only:

sonar \ eye | 0 | 1 (green) |

0 (small) | do nothing | approach and eat |

1 (large) | do nothing | do nothing |

We can get this same effect by assigning weights to the two inputs, forming the weighted sum of the inputs, and firing if the weighted sum is bigger than zero. Suppose that we assign a weight of +1 (it does not need to be an integer) to the "eye" and a weight of -1.5 to the "sonar". For example, if the object in view is green (eye reports a 1) and small (sonar reports a 0), then the weighted sum is

1.0*1.0 + (-1.5) * 0 = 1.0 > 0

And the slug approaches the object. On the other hand, if the object is not green (eye reports a 0) and large (sonar reports a 1), the weighted sum is

1.0 * 0 + (-1.5)*1 = -1.5 < 0

and this neuron does not fire (the other one is probably firing, moving the slug away from the object.

A table of the weighted sums in each case summarizes the actions.

sonar \ eye | 0 | 1 (green) |

0 (small) | 0.0 | 1.0 |

1 (large) | -1.5 | -0.5 |

**Exercise**: Verify that this does what the earlier table says we should do.

**Exercise**: Come up with tables and weights for the "run away" neuron.

Informally, an Artificial Neural Network (ANN or "neural network") is an interconnected network of simple processing elements.

In the following, we will examine three models for neural networks:

- Perceptrons
- General feed-forward networks
- Hopfield networks

The presentation follows roughly that of Rich/Knight (see below for reference).

We begin with one of the simplest models for neural networks. A perceptron is a single unit with a number of inputs x0, x1, ... xn. With each input is associated a weight w0, w1, ... wn. Weights can be positive or negative. The input x0 is called the 'bias' input and always has the value 1.

The processing element simply forms the weighted sum \sum_{i=0}^n{xi*wi} (I'll draw this in class). If the sum is > 0, the perceptron "fires", producing an output of 1. If the sum is <= 0, the perceptron does not fire (producing an output of 0).

**Example 1:** Consider a perceptron with the bias input (x0, always set to +1), and a second input x1, with weights w0 = 0.5 and w1 = -1. Suppose that x1 = +1. Then the weighted sum is w0*x0 + w1*x1 = 0.5*1 + -1*1 = 0.5 - 1 = -0.5, and the perceptron fails to fire (i.e., returns a 0). When x1 = 0, the perceptron fires (the reader should check this), and we have the following table:

x1 | out |

0 | 1 |

1 | 0 |

If we think of '1' as "true" and '0' as "false", this is the truth table for the logical operation **NOT**.

**Exercise: **(not to hand in - all exercises in this document are simply for the help of the reader): Consider a perceptron with w0 set to -3, and w1 and w2 both set to 2. Check that the outputs of this perceptron are as in the table below:

x1 | x2 | out |

0 | 0 | 0 |

0 | 1 | 0 |

1 | 0 | 0 |

1 | 1 | 1 |

Again with the same reading for '0' and '1', this is the truth table for **AND**.

**Exercise:** Come up with weights for the **OR** function, as in the following table:

x1 | x2 | out |

0 | 0 | 0 |

0 | 1 | 1 |

1 | 0 | 1 |

1 | 1 | 1 |

Perceptrons are **classifiers**. That is, they fire or do not fire depending on whether or not a data point is in a given set. For example, a perceptron for determining whether or not students should be advised to take a music major would accept a number of inputs (perhaps including what instruments students played, how much experience on each, number of performances in high school, high school overall gpa, high school music gpa and the like - all expressed as numbers) and would fire (produce a 1) for those students who might be advised into a music major and fail to fire for others. There are several very important facts to keep in mind, however.

**Fact 1 (The perceptron learning theorem):** A perceptron can learn to do anything it can do.

The way this works is (roughly) as follows: We begin with a set of randomly chosen weights and run input through the perceptron, collecting errors. We then use the errors collected in this pass to modify the weights to make the errors less likely the next time. We continue to do this until the errors have been corrected. The surprising thing is that, if there are valid weights at all, this process will find them.

**Fact 2: ** Perceptrons can not do every classification problem.

To see how this comes about we need a bit of algebra (and since I don't know how to draw in HTML, you are going to have to do much of this on your own). Let's begin with the perceptron example which gave rise to the AND operation. We used weights w0 = -3, w1 = 2, w2 = 2. For each input x1 and x2 we formed the weighted sum

w0*x0 + w1*x1 + w2*x2

and checked to see if it was bigger than zero, giving the linear inequality

w0*x0 + w1*x1 + w2*x2 > 0

To make life easier for us, let's remember that x0 always is 1, and write 'x' for x1 and 'y' for x2. We get a somewhat more familiar

w0 + w1*x + w2*y > 0

Now the way to graph linear inequalities is to first graph the equality; in this case

w0 + w1*x + w2*y = 0

But this looks like the equation of a line! In particular, if w2 is not zero, we get the line given by

y = -(w1/w2)*x - (w0/w2)

Every point for which the perceptron fires lies on one side of this line, and every point for which it fails to fire lies on the other side of this line.

The process can be reversed with a little bit of care: if you can plot the points for which you want the perceptron to fire and those for which the perceptron should not fire (noting the difference in some fashion) then, if you can draw a line through the two sets with everything that should fire on one side of the line and everything that should not fire on the other, you have a perceptron. Simply write down the equation of the line you have drawn set it in the form with everything to one side set equal to zero, and then the coefficients of x, y, and the constant term (or their negatives) will work as weights.

With two inputs, it is easy to do this. With three inputs, our points are points in 3-dimensional space and the separating object is a (2 dimensional) plane. In four dimensions we have a separating hyperplane, and so on. We'll not play with any more than two dimensions in this class.

There is a problem, however, which brings us back to fact 2: The **XOR** problem can not be solved with a single perceptron. That is, there is no single perceptron which gives rise to the following truth table:

x1 (or X) | x2 (or Y) | out |

0 | 0 | 0 |

0 | 1 | 1 |

1 | 0 | 1 |

1 | 1 | 0 |

To see this, plot the four points and try to draw a straight line between them!

What can we do?

The solution to the problem was worked out by P. J. Werbos in a 1974 Masters' Thesis from Harvard University. Although this did not make much of an impact, the same algorithm was rediscovered and published in Rummelhart and McClelland in the two volume __Parallel Distributed Processing__ series.

To see how we might solve the problem, we begin by examining two features of perceptrons: the "**activation function**" and the "**learning algorithm**".

In perceptrons, the activation function takes the weighted sum of the inputs and produces a 1 if the weighted sum is > 0 and a 0 otherwise.

The learning algorithm (which arises from the Perceptron Learning Theorem) works in the following way:

- The weights of the perceptron are set to small random numbers.
- The inputs are run through the perceptron and the results are checked against the expected outputs for each input. The errors (inputs for which the perceptron incorrectly fired or failed to fire) are collected and are used to adjust the weights.
- This last step is repeated until all inputs are correctly classified.

The Perceptron Learning Theorem guarantees that if there is a set of weights which works, this process will find a set of weights that work.

For problems that a single perceptron can not solve, we use a network of slightly different neuron elements. We take one level of neurons from the set of inputs to a second, "hidden" set of neurons. We then take the neurons in the "hidden layer" to the set of outputs. Each neuron in the input layer is connected to every neuron in the hidden layer, and every neuron in the hidden layer is connected to each output neuron. A "bias" neuron is made part of the input and hidden layers.

(Neat picture here if I knew how to do that)

There are several changes: First, the activation function replaces the (0,1) step function of the weighted sum by modifications of one of two new functions:

F(x) = 1/(1+Exp(-x))

G(x) = 2/(1+Exp(-x))-1

There are several reasons why these functions are selected, several technical (Calculus related) and one biological (the "squashed sigmoid" appears in mathematical ecology). In these formulas, x is replaced by the weighted sum.

The second change is in the learning algorithm, which is called **backpropagation**.

**The backpropagation algorithm**

In the learning algorithm for this new neural network, we select a "training set" of inputs to the neural network and calculate the expected outputs. We then set the two sets of weights (one set from the inputs to the hidden layer, the second from the hidden layer to the outputs) to small random numbers, and do the following:

- Run the inputs through the network, comparing the outputs produced by the neural network with the expected value yielding a set of errors.
- The errors are processed, and are used first to correct the weights from the hidden layer to the output layer and then from the input layer to the hidden layer.

The process repeats until the combined errors are acceptably small. Notice that the inputs are propagated from the input layer to the output layer and that the errors are incorporated first in weights leading to the output layer and then secondly in the weights leading to the hidden layer - hence inputs are fed **forward** to outputs and errors are propagated **backwards** from the output to the input, hence **backpropagation**.

The mathematics to explain how this works (and what happens when it does not) is beyond the scope of this course. It is not difficult, but requires the tools of multivariable calculus (third semester calculus). It, like the perceptron learning algorithm, is a gradient descent search method.

There are lots of other neural architectures. Backpropagation is (I believe) the most common, but neural networks exist which do their own training (i.e., do not require the comparison of computed outputs to expected outputs which one might think of as provided by a teacher), neural networks which more closely mimic the organization of the brain, and networks which feed back on themselves instead of moving the calculations from some input nodes to output nodes. Such networks are called **recurrent networks** As a final example, we will very briefly examine one of my favorite recurrent networks: the Hopfield network.

Imagine a picture, made up of a number of pixels, each of which is either on (1) or off (-1). Suppose that our goal is to store these pictures in a network in such a way that if the image becomes corrupted in some fashion we can recover it. We do this by setting weights between pixel elements in the following way:

- If two pixels have the same value in a picture (both on or off), we set the weight between them to +1.
- If two pixels have different values (one on and one off), we set the weight between them to -1.

Our activation function works as follows. If a (possibly corrupted) picture is stored in the network, we update individual neurons (representing pixels) by computing the weighted sum of all the input neurons (where both weights and neurons take the values +1 and -1). If the weighted sum is >0 we set the neuron to +1. If the weighted sum is < 0 we set the neuron to -1. We hope that the network settles down to the desired picture.

If we have several images to store, we set the weights between pixels to a normalized weight taken from the sum of all the weights between pixels.

There are, of course, further applications of Hopfield networks (including such interesting problems as finding good solutions to the Traveling Salesperson Problem), but this is enough to give us the idea of recurrent networks.

- 1943 - McCulloch & Pitts
- 1949 - Hebbian Learning
- 1950's - Perceptrons
- Late 1960's: Minsky, M.L., and Papert, S:
__Perceptrons__(the book that almost put an end to the field) - 1974 - Werbos, P. J.: Harvard University Masters' Thesis
- 1982 - Hopfield
- 1986 - Parallel Distributed Processing (PDP): new learning rules
- Now - Explosion in field

- Rich/Knight: I
__ntroduction to Artificial Intelligence__(second edition) - Russell and Norvig:
__Artificial Intelligence__ - Wasserman, Philip D.:
__Neural Computing: Theory and Practice__. Van Nostrand Reinhold (1989) - Wasserman, Philip D.:
__Advanced Methods in Neural Computing__. Van Nostrand Reinhold 1993. - Zurada