

See Hinton2010: A Practical Guide to Training Restricted Boltzmann Machines Version 1



HOW TRAINING WORKS:
X: has the samples (around 1 million) 
Y: has the classes (and should be 1, 2, 3, .. (not 0)). Original is 1 column but FineTunning converts it to [0 0 1], [0 1 0], etc.
depending on the number of outputs.

There is two stages:
1) Pretraining (Layerwise) <-> Bolztman Machine (unsupervised)
2) Fine tune NN
In the traditional NN training we initatilize randomly the weights and a descending error algorithm that emplies samples (X) and classes (Y) 
finds the best weigths. That is what FineTuning does but now, pretraining, without using Y, in a similar way to GMM training, find a good initialization
for the weights. To see that the pretraining makes sense, compare the classification error when the weights of FineTuning are initialize ramdonly and with 
pretraining.

 
The training is good when the activation plot of figure 1 2 and 3 seem random and in fig 4 the training error is mantein low and the development error start to increase


-------------------------------------------------------------------------------------------------------
IMPORTANT PARAMETERS:
 
Number of hidden units in different layers: [52, 7, 7] 				(in trainRBMStack/dbn.sizes)
Learning rate of the unsupervised pretraining: 0.004 				(in trainRBMStack/opts.alpha)
Max epoch of the unsupervised pretraining: 100 					(in trainRBMStack/opts.numepochs)
Momentum: 0.5 (init) 0.9 (later) 						(in trainRBMStack/opts.initMomentum, method for increasing the speed of learning, determine
how fast it is explored the search space)


Learning rate of the supervised fine-tuning: 0.005 				(in finetuneDBN/nn.learningRate)	
Maximun epoch of the supervised fine-tuning: 130 				(in finetuneDBN/opts.numepochs)
Number of outputs: 2 								(in finetuneDBN/dbnunfoldtonn)
Training/Developtment percentage: first 0.8 and finally second fine with 1.0 	(in finetuneDBN/trainPerc)
Size of the minibatch: 10 							(in finetuneDBN/batchsize, if x, the training takes 
x samples to compute the gradient direction and later take another new x and son on) %train size must be multiple of minibatch

Weight decay: 
Sparsity target:

-----------------------------------------------------------------------------

TEST:

If nn is the network, nn.a{i} is the acctivation of layer i and nn.a{end} is the output posterior (output probabilities). 

IMPROVEMENTS:

A recursive NN has the input: Features (Ft) + Part of Output (Ot-1), output: Output Ot. The output prediction (Ot) at instant t, depende on 
Ft and Ot-1. 
