Rich/Knight defines Artificial Intelligence as the effort to make computers do what humans do well and that computers (for now, at least) do not do well. These tasks can be divided up in a variety of ways:

- High level/low level tasks. For example, things like playing games, understanding natural languages, proving theorems, etc. could be considered "high level" tasks, while things like recognizing how the representation of edges fit together in a scene or where one sound in a language ends and another begins might be considered "low level" tasks. The "low level" tasks are frequently harder.
- The intent of the effort (this is an issue of how we may approach the
solution to a problem). We can take the approach that while we might
gain inspiration from how people (monkeys, bears, bees, spiders) solve a
problem, but we solve the problem in any way that works, or we can attempt
to model how people (or monkeys, etc.) solve these sort of problems.
The first approach might be called the engineering approach (and is a
perfectly appropriate approach - while airplanes were inspired by birds, a
747 does not model the way a bird flies), and the second approach might be
called the
**cognitive science**approach. It is this second approach that we consider here. - The
**symbolic**versus the**connectionist**approach. This distinction between these two approaches is outlined below.

The symbolic approach to artificial intelligence views the brain as a sort of computer on which intelligence action (including consciousness) runs as a sort of computer program. This approach developed almost simultaneously with the development of computers (indeed, Alan Turing was thinking of computer programs to play chess as the first computers were being built). In this approach the claim is made (see the Physical Symbol System Hypothesis below) that intelligent action can be modeled by a computer manipulating complex data structures ("Symbol Systems" in the Newell and Simon sense), with additional sensors (think of eyes, ears, touch, etc.) and effectors (think hands). Such a combination of memory stored in complex data structures, a program which can interact with these data structures and sensors and effectors is called a Physical Symbol System. See the Newell and Simon paper for more information. See below for a citation.

The Physical Symbol System Hypothesis (Newell and Simon, *Computer Science
as empirical inquiry: symbols and search*, available through the ACM
portal on the library's Research Gateway) makes the assertion that intelligent action is the product of a physical symbol system and that therefore the intelligent action that we perform daily is explained as
the work of a physical symbol system. This is the primary assertion of
symbolic AI, sometimes called "Good old-fashioned AI (GOFAI)

However, what we have in the brain looks more like the "cold porridge" that Hugh Whitemore has Turing mention at the start of the second act of his play
*Breaking the Code (*and which is taken almost directly from Turing's 1950
paper* Computing Machinery and Intelligence *(Mind Vol. 59, No. 236, pages
433 - 460), available on JSTOR). And no matter how we dissect the brain, we are unlikely to find symbol structures. No semantic networks. No frames. Not a script in sight.

Well, says the GOFAI enthusiast, what the physical symbol system hypothesis does is to provide us an explanation for intelligent activity at the algorithm/data structure level, and the neural network that is the brain provides us with an explanation on the implementation level. But that is not a completely satisfactory explanation. In Marr's analysis of an information processor
(David Marr *Vision*, 1983), each level of explanation constrains the ones around it. And until we understand better how a physical symbol system can be implemented on a neural network perhaps we should begin with neural networks. The brain is, after all, a highly interconnected network of a very great many neurons. Perhaps, this second faction argues, we should begin with this model for computation and see where it leads us. Let's begin.

Biologically, a neuron is a unit containing a nucleus, one or more extensions accepting input from other neurons, and one or more extensions producing output to other neurons. With a great many apologies to the biologists, we model this by a unit accepting numeric values from a series of inputs, applying a function of the weighted sum of inputs to produce an output.

(a neat picture would go here if I knew how to do that)

The output (in our greatly simplified example) is either a 0 (the neuron fails to fire) or a 1 (the neuron fires).

The following example is due to Tom Fikes who taught the CogSci course several years ago with Cathy Hale, Bill Beardsley, and Bob Matthews. Suppose that we are the first explorers on a rather odd planet. The surface is bleak, and not much is going on of any interest, but there are some interesting creatures. The one that we become most interested in is a slug-like creature which feeds on small green bugs. The life of this slug-like creature is pretty good except for a somewhat larger blue bug which feeds on these interesting slugs.

The slugs have two eye-like organs on what appears to be the head of the slug. One detects color (green and blue), and one, a sonar-like device, senses size. The slug has a foot of sorts which can either propel it towards the object in view and or away from it. Further examination reveals that the nervous system of the slug is really very simple: there are only two neurons in the system! One of them is connected to the two eye-like organs and to the foot. When it is active (that is, when it fires) the slug approaches whatever is in its view (intending to eat it). The other is arranged in exactly the same way, but when it fires, the slug moves away from whatever is in its view.

How does this work? Let's take a very simple view of a slug neuron. It consists of two inputs, some simple processing taking place in the nucleus of the neuron, and an output connected to the tail. Let's consider the neuron controlling whether or not to approach an object in view. To further simplify our discussion, let us suppose that when something green (food) is in view, the "eye" activates its input to the neuron, but otherwise the "eye" does not activate its neuron. For further convenience, we will assign a value of 1 for green, and a value of 0 for anything else. Similarly, the "sonar" connected to the same neuron will activate its input to the neuron under consideration when something large is in view, but will not activate its input to our neuron otherwise.

When do we want the neuron to "fire" - that is, activate the foot into motion towards whatever is in view? We want to do that if the object in view is green, but not large. That is, we want to do that when the "eye" reports a 1 (indicating a green object), and when the "sonar" reports a 0 (indicating that the object is not large). The safest approach would be not to move towards the object if it is either large or not green. A similar discussion would tell us when we would want to activate the "running away" neuron.

So what should the "processing element" (the nucleus) do with the inputs? How can we model this numerically? One way to model what the biological nucleus will do is to do the following: Multiply each of the two inputs by previously determined numbers (called "weights") and add them up. If this "weighted sum is bigger than zero, the artificial neuron will produce a "1" and the slug will approach the object intending to eat it. Otherwise, this artificial neuron will produce a zero and will not instruct the foot to do anything. (Of course, the other neuron may be frantically telling the foot to run away.) A table may help in understanding what to do for this neuron only:

sonar \ eye | 0 | 1 (green) |

0 (small) | do nothing | approach and eat |

1 (large) | do nothing | do nothing |

We can get this same effect by assigning weights to the two inputs, forming the weighted sum of the inputs, and firing if the weighted sum is bigger than zero. Suppose that we assign a weight of +1 (it does not need to be an integer) to the "eye" and a weight of -1.5 to the "sonar". For example, if the object in view is green (eye reports a 1) and small (sonar reports a 0), then the weighted sum is

1.0*1.0 + (-1.5) * 0 = 1.0 > 0

And the slug approaches the object. On the other hand, if the object is not green (eye reports a 0) and large (sonar reports a 1), the weighted sum is

1.0 * 0 + (-1.5)*1 = -1.5 < 0

and this neuron does not fire (the other one is probably firing, moving the slug away from the (possibly dangerous) object.

A table of the weighted sums in each case summarizes the actions.

sonar \ eye | 0 | 1 (green) |

0 (small) | 0.0 | 1.0 |

1 (large) | -1.5 | -0.5 |

**Exercise**: Verify that this does what the earlier table says we
should do.

**Exercise**: Come up with tables and weights for the "run away"
neuron.

That is pretty much how artificial neurons work. The artificial neuron accepts some inputs in the form of numbers, does some processing (generally involving the weighted sum of the inputs), and produces an output. In the case of the artificial neuron described above (a perceptron), we take the weighted sum and apply a "step function" producing a 1 if the weighted sum is greater than zero and a zero otherwise. Later on we will look at a more complicated "activation function".

Really simple. Not much to it at all, and it is surprising that any of this works for anything interesting. The trick (also played by the brain) is to use a (generally) very large number of these simple processing elements (aka "neurons") with outputs of individual neurons connected to the (sometimes very many) inputs of other neurons, forming a network of a large number of these simple processing. Informally, an Artificial Neural Network (ANN or "neural network") is an interconnected network of these simple processing elements.

This characterizes the "**connectionst approach**" (as opposed to the **
"symbolic" approach**): modeling intelligent behavior using a highly
interconnected system of a (generally a large number of) simple processing
elements (Rich/Knight).

In the following, we will examine three models for neural networks:

- More on perceptrons and networks of perceptions
- General feed-forward networks and backpropagation
- Hopfield networks as a completely different example

The presentation follows roughly that of Rich/Knight chapter 9 (see below for reference).

We begin with one of the simplest models for neural networks. A perceptron is a single unit with a number of inputs x0, x1, ... xn. With each input is associated a weight w0, w1, ... wn. Weights can be positive or negative. The input x0 is called the 'bias' input and always has the value 1.

Perceptrons are (mostly) **classifiers**. That is, they fire or do not fire
depending on whether or not a data point is in a given set. For example, a
perceptron for determining whether or not students should be advised to take a
music major would accept a number of inputs (perhaps including what instruments
students played, how much experience on each, number of performances in high
school, high school overall gpa, high school music gpa and the like - all
expressed as numbers) and would fire (produce a 1) for those students who might
be advised into a music major and fail to fire for others.

The processing element simply forms the weighted sum \sum_{i=0}^n{xi*wi} (I'll draw this in class). If the sum is greater than 0, the perceptron "fires", producing an output of 1. If the sum is <= 0, the perceptron does not fire (producing an output of 0).

**Example 1:** Consider a perceptron with the bias input (x0, always set to +1), and a second input x1, with weights w0 = 0.5 and w1 = -1. Suppose that x1 = +1. Then
the weighted sum is w0*x0 + w1*x1 = 0.5*1 + -1*1 = 0.5 - 1 = -0.5, and the
perceptron fails to fire (i.e., returns a 0). When x1 = 0, the perceptron
fires (the reader should check this), and we have the following table:

x1 | out |

0 | 1 |

1 | 0 |

If we think of '1' as "true" and '0' as "false", this is the truth table for
the logical operation **NOT**.

**Exercise: **(not to hand in - all exercises in this document are simply
to aid the reader): Consider a perceptron with w0 set to -3, and
w1 and w2 both set to 2. Check that the outputs of this perceptron are as
in the table below:

x1 | x2 | out |

0 | 0 | 0 |

0 | 1 | 0 |

1 | 0 | 0 |

1 | 1 | 1 |

Again with the same reading for '0' and '1', this is the truth table for **
AND**.

**Exercise:** Come up with weights for the **OR** function, as in
the following table:

x1 | x2 | out |

0 | 0 | 0 |

0 | 1 | 1 |

1 | 0 | 1 |

1 | 1 | 1 |

There are several very important facts to keep in mind:

**Fact 1 (The perceptron learning theorem):** A perceptron can learn
to do anything it can do.

The way this works is (roughly) as follows: We begin with a set of
randomly chosen weights and run input through the perceptron, collecting errors.
We then use the errors collected in this pass to modify the weights to make the
errors less likely the next time. We continue to do this until the errors
have been corrected. The surprising thing is that, if there are valid
weights at all, this process will find them. This is an example of *
training* a neural network, and is typical of a great many neural networks.
(For calculus students: this is generally an example of *gradient
descent*)

**Fact 2: ** Perceptrons can not do every classification problem.

To see how this comes about we need a bit of algebra (and since I don't know how to draw in HTML, you are going to have to do much of this on your own). Let's begin with the perceptron example which gave rise to the AND operation. We used weights w0 = -3, w1 = 2, w2 = 2. For each input x1 and x2 we formed the weighted sum

w0*x0 + w1*x1 + w2*x2

and checked to see if it was bigger than zero, giving the linear inequality

w0*x0 + w1*x1 + w2*x2 > 0

To make life easier for us, let's remember that x0 always is 1, and write 'x' for x1 and 'y' for x2. We get a somewhat more familiar

w0 + w1*x + w2*y > 0

Now the way to graph linear inequalities is to first graph the equality; in this case

w0 + w1*x + w2*y = 0

But this looks like the equation of a line! In particular, if w2 is not zero, we get the line given by

y = -(w1/w2)*x - (w0/w2)

Every point for which the perceptron fires lies on one side of this line, and every point for which it fails to fire lies on the other side of this line.

The process can be reversed with a little bit of care: if you can plot the points for which you want the perceptron to fire and those for which the perceptron should not fire (noting the difference in some fashion) then, if you can draw a line through the two sets with everything that should fire on one side of the line and everything that should not fire on the other, you have a perceptron. Simply write down the equation of the line you have drawn set it in the form with everything to one side set equal to zero, and then the coefficients of x, y, and the constant term (or their negatives) will work as weights.

With two inputs, it is easy to do this. With three inputs, our points are points in 3-dimensional space and the separating object is a (2 dimensional) plane. In four dimensions we have a separating hyperplane, and so on. We'll not play with any more than two dimensions in this class.

There is a problem, however, which brings us back to fact 2: The **XOR**
problem can not be solved with a single perceptron. That is, there is no
single perceptron which gives rise to the following truth table:

x1 (or X) | x2 (or Y) | out |

0 | 0 | 0 |

0 | 1 | 1 |

1 | 0 | 1 |

1 | 1 | 0 |

To see this, plot the four (X, Y) points and try to draw a straight line between them!

What can we do? One way to solve the problem is to use a network of perceptrons in which the outputs of perceptions are fed into the inputs of other perceptrons:

(neat picture if I knew how to do that in HTML)

This still presents a problem. It is easy to train an individual perceptron, but how do you train a network? This problem very nearly put a stop to research in artificial neural networks from 1971 to 1986 (STS students - this story would make a great paper!). For 15 years, very little got done in artificial neural networks.

The solution to the problem was worked out by P. J. Werbos in a 1974 Masters'
Thesis from Harvard University. Although this did not make much of an
impact, the same algorithm was rediscovered and published by Rummelhart and
McClelland in the two volume __Parallel Distributed Processing__ series.

To see how we might solve the problem, we begin by examining two features of
perceptrons: the "**activation function**" and the "**learning algorithm**".

In perceptrons, the activation function takes the weighted sum of the inputs and produces a 1 if the weighted sum is > 0 and a 0 otherwise.

The learning algorithm (which arises from the Perceptron Learning Theorem) works in the following way:

- The weights of the perceptron are set to small random numbers.
- The inputs are run through the perceptron and the results are checked against the expected outputs for each input. The errors (inputs for which the perceptron incorrectly fired or failed to fire) are collected and are used to adjust the weights.
- This last step is repeated until all inputs are correctly classified.

The Perceptron Learning Theorem guarantees that if there is a set of weights which works, this process will find a set of weights that work.

For problems that a single perceptron can not solve, we use a network of slightly different neuron elements. We take one level of neurons from the set of inputs to a second, "hidden" set of neurons. We then take the neurons in the "hidden layer" to the set of outputs. Each neuron in the input layer is connected to every neuron in the hidden layer, and every neuron in the hidden layer is connected to each output neuron. A "bias" neuron is made part of the input and hidden layers.

(Neat picture here if I knew how to do that)

There are several changes: First, the activation function replaces the step function of the perceptron by modifications of one of two new functions:

F(x) = 1/(1+Exp(-x))

G(x) = 2/(1+Exp(-x))-1

There are several reasons why these functions are selected, several technical (Calculus related) and one biological (the "squashed sigmoid" appears in mathematical biology). In these formulas, x is replaced by the weighted sum of the inputs (calculated in the same way as for perceptrons)

The second change is in the learning algorithm, which is called **
backpropagation**.

**The backpropagation algorithm**

In the learning algorithm for this new neural network, we select a "training set" of inputs to the neural network and calculate the expected outputs. We then set the two sets of weights (one set from the inputs to the hidden layer, the second from the hidden layer to the outputs) to small random numbers, and do the following:

- Run the inputs through the network, comparing the outputs produced by the neural network with the expected value yielding a set of errors.
- The errors are processed, and are used first to correct the weights from the hidden layer to the output layer and then from the input layer to the hidden layer.

The process repeats until the combined errors are acceptably small.
Notice that the inputs are propagated from the input layer to the output layer
and that the errors are incorporated first in weights leading to the output
layer and then secondly in the weights leading to the hidden layer - hence
inputs are fed **forward** to outputs and errors are propagated **backwards**
from the output to the input, hence **backpropagation**.

The mathematics to explain how this works (and what happens when it does not)
is beyond the scope of this course. It is not difficult, but requires the
tools of multivariable calculus (third semester calculus). It, like the
perceptron learning algorithm, is a *gradient descent* search method.

There are lots of other neural architectures. Backpropagation is (I
believe) the most common, but neural networks exist which do their own training
(i.e., do not require the comparison of computed outputs to expected outputs
which one might think of as provided by a teacher), neural networks which more
closely mimic the organization of the brain, and networks which feed back on
themselves instead of moving the calculations from some input nodes to output
nodes. Such networks are called **recurrent networks** As a final
example, we will very briefly examine one of my favorite recurrent networks: the
Hopfield network.

Imagine a picture, made up of a number of pixels, each of which is either on (1) or off (-1). Suppose that our goal is to store these pictures in a network in such a way that if the image becomes corrupted in some fashion we can recover it. We do this by setting weights between pixel elements in the following way:

- If two pixels have the same value in a picture (both on or off), we set the weight between them to +1.
- If two pixels have different values (one on and one off), we set the weight between them to -1.

Our activation function works as follows. If a (possibly corrupted) picture is stored in the network, we update individual neurons (representing pixels) by computing the weighted sum of all the input neurons (where both weights and neurons take the values +1 and -1). If the weighted sum is >0 we set the neuron to +1. If the weighted sum is < 0 we set the neuron to -1. We hope that the network settles down to the desired picture.

If we have several images to store, we set the weights between pixels to a normalized weight taken from the sum of all the weights between pixels of the individual images.

There are, of course, further applications of Hopfield networks (including such interesting problems as finding good solutions to the Traveling Salesperson Problem), but this is enough to give us the idea of recurrent networks.

- 1943 - McCulloch & Pitts
- 1949 - Hebbian Learning
- 1950's - Perceptrons
- Late 1960's: Minsky, M.L., and Papert, S:
__Perceptrons__(the book that almost put an end to the field) - 1974 - Werbos, P. J.: Harvard University Masters' Thesis
- 1982 - Hopfield
- 1986 - Parallel Distributed Processing (PDP): new learning rules
- Now - Explosion in field

- Rich/Knight: I
__ntroduction to Artificial Intelligence__(second edition) - Russell and Norvig:
__Artificial Intelligence__ - Wasserman, Philip D.:
__Neural Computing: Theory and Practice__. Van Nostrand Reinhold (1989) - Wasserman, Philip D.:
__Advanced Methods in Neural Computing__. Van Nostrand Reinhold 1993. - Zurada

If anyone would like to pursue these ideas further, please see me (Bob Matthews).

2/10/2009

Bob Matthews

**Some notes for Friday's Exam**. The first note (and part of the
second) were discussed in Monday's lecture and are available in the course notes
(and in the above). The remainder will be discussed in Wednesday's
lecture.

- Be able to say what
**artificial intelligence**is and what it tries to do - Be able to describe a
**perceptron**. For a given perceptron and a given set of inputs, be able to say whether or not the perceptron fires. - Be able to briefly define what an
**activation****function**is - Be able to briefly describe how perceptrons (and
**artificial neural networks**in general) are**trained**. - Be able to describe the
**connectionist**approach to AI.