Neural Network II Architecture & Perceptrons

This is a note for learning Neural Network in Machine Learning based on this course on Coursera.

An overview of the main types of neural network architecture

1. Feed-forward neural networks

The commonest type of neural network in practical applications
If there is more than one hidden layer, we call them “deep” neural networks
The activities of the neurons in each layer are a non-linear function of the activities in the layer blow

2. Recurrent networks

Have directed cycles in their connection graph
Powerful and biologically realistic, but very difficult to train
Recurrent nets with multiple hidden layers are just a special case that has some of the hidden->hidden connections missing compare to the general recurrent network

It is very natural way to model sequential data
Have the ability to remember information for quite a long time

An exciting example: Ilya Sutskever (2011), trained a special recurrent NN to predict the next character in a sequence, after training for a long time on a string of half a billion characters from English Wikipedia, and made it generate a new text by predicting the probability distribution for the next character and then sampling a character from this distribution. The result is quite reasonable with great syntax. Demo.

Perceptrons

Standard Perceptron architecture

So the perceptron is just learning last hidden layer’s weight. It is linear, and can not learn features.
Perceptron uses binary threshold neuron as its final last output layer.

Learning procedure

Pick training cases using any policy that ensures that every training case will keep getting picked.
If the output unit is correct, leave its weights alone.
If the output unit incorrectly outputs a zero, add the input vector to the weight vector.
If the output unit incorrectly outputs a 1, subtract the input vector from the weight vector.

What perceptrons can’t do

First example: XOR

Positive cases: (1,1) → 1; (0,0) → 1;
Negative cases: (1,0) → 0; (0,1) → 0;

Also it can be proved in a geometric view:

Second example: Discriminating patterns

For two patterns like:

If we treat every individual pixel as an input, than pattern A and B’s output function are the same.
The reason is that, if we do not apply preprocessing on the input feature, the perceptron cannot learn translation of patterns.

So we must make NN to learn the inside features which is same to adding hidden units, but we need an efficient way to adapting all the weights.