Coursera深度学习笔记
2019-11-04 13:44:16 0 举报
AI智能生成
Coursera深度学习Deep Learning思维导图笔记之Improving Deep Neural Networks, 包括初始化, 损失函数, 优化函数,TensorFlow入门用法,一些基本概念、公式等。
作者其他创作
大纲/内容
Inititalization
Zero initialization
In general, initializing all the weights to zero results in the network failing to break symmetry.
Random initialization
np.random.randn(layers_dims[l],layers_dims[l-1])
Xavier initialization
np.random.randn(layers_dims[l],layers_dims[l-1]) * np.sqrt(1/layers_dims[l-1])
The basic idea of Xavier initialization is to keep the variance of inputs and outputs consistent
He initialization
np.random.randn(layers_dims[l],layers_dims[l-1]) * np.sqrt(2/layers_dims[l-1])
For layers with a ReLU activation.
Regularization
L2 regularization
1/m * lambd/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3)))
L2 regularization cost
Dropout
With dropout, your neurons thus become less sensitive to the activation of one other specific neuron
Forward prop: steps 1-4 are described below.
Step 1: initialize matrix D1 = np.random.rand(..., ...)
np.random.rand(np.shape(A1)[0],np.shape(A1)[1])
Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
(D1 < keep_prob)
Step 3: shut down some neurons of A1
A1 * D1
Step 4: scale the value of neurons that haven't been shut down
A1 / keep_prob
Backward prop: steps 1-2 are described below.
Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
after dA2 = np.dot(W3.T, dZ3), |dA2 = D2*dA2|
Step 2: Scale the value of neurons that haven't been shut down
| dA2 = dA2/keep_prob | before dZ2 = np.multiply(dA2, np.int64(A2 > 0))
You should use dropout (randomly eliminate nodes) only in training!
Gradient Checking
θ+=θ+ε
θ−=θ−ε
J+=J(θ+)
J−=J(θ−)
gradapprox=(J+−J− )/ 2ε
θ−=θ−ε
J+=J(θ+)
J−=J(θ−)
gradapprox=(J+−J− )/ 2ε
Compute the gradient using backward propagation, and store the result in a variable "grad"
difference < 1e-7: The gradient is correct!
numerator = np.linalg.norm(grad-gradapprox)
denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)
Optimization
Batch Gradient Descent
Take gradient steps with respect to all m examples on each step.
Gradient Descent Figure
a, caches = forward_propagation(X, parameters)
Stochastic Gradient Descent
compute gradients on just one training example at a time.
Stochastic Gradient Descent Figure
a, caches = forward_propagation(X[:,j], parameters)
Mini-Batch Gradient descent
Take gradient steps with respect to all mini-batch-size examples on each step.
Mini-Batch Gradient Descent Figure
Two steps to perfom Mini-Batch Gradient descent:
Step1. Shuffle (X, Y)
permutation = list(np.random.permutation(m))
shuffled_X = X[:, permutation]
shuffled_Y = Y[:, permutation].reshape((1,m))
shuffled_X = X[:, permutation]
shuffled_Y = Y[:, permutation].reshape((1,m))
Step2. Partition (shuffled_X, shuffled_Y). Handle the end case.
mini_batch_X = shuffled_X[:, mini_batch_size * k : mini_batch_size * (k+1)]
end case: mini_batch_X = shuffled_X[:, mini_batch_size * num_complete_minibatches : ]
Momentum
Momentum takes into account the past gradients to smooth out the update.
Equations of Momentum Optimization Method
RMSProp
Equations of RMSProp Optimization Method
Adam
Combines ideas from RMSProp and Momentum.
Equations of Adam Optimization Method
b[l] should apply the similar process as W[l]
Batch Normalization
Equations of Batch Normalization
The input value of any neuron in each layer of neural network was set to the standard normal distribution with mean of 0 and variance of 1
Softmax Regression
Softmax regression generalizes logistic regression to C classes.
Equations of softmax
Loss function
ti is the actual value, yi here is yhat-i (softmaxed value)
Derivate
Cost function
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = ..., labels = ...))
"logits" and "labels" are expected to be shape(number of examples, num_classes)
logits = tf.transpose(ZL)
labels = tf.transpose(Y)
labels = tf.transpose(Y)
TensorFlow Tutorial
Initializations
tf.Variable((y - y_hat)**2, name='loss')
tf.constant(39, name='y')
init = tf.global_variables_initializer()
session.run(init)
The loss variable will be initialized and ready to be computed.
x = tf.placeholder(tf.int64, name = 'x')
sess.run(2 * x, feed_dict = {x: 3})
A placeholder is an object whose value you can specify only later.
Operations
tf.add(..., ...) to do an addition
tf.matmul(..., ...) to do a matrix multiplication
tf.multiply(...,...) to do an element-wise multiplication
Functions
sigmoid = tf.sigmoid(x)
one_hot_matrix = tf.one_hot(labels,C,1)
"One Hot" encoding
One element of each column is "hot" (meaning set to 1)
W1 = tf.get_variable("W1", [25,12288], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
b1 = tf.get_variable("b1", [25,1], initializer = tf.zeros_initializer())
A1 = tf.nn.relu(Z1)
to apply the ReLU activation
correct_prediction = tf.equal(tf.argmax(Z3), tf.argmax(Y))
Calculate the correct predictions
Others
X_train_flatten = X_train_orig.reshape(X_train_orig.shape[0], -1).T
To flatten the training images
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
cost = tf.nn.sigmoid_cross_entropy_with_logits(logits = z, labels = y)
_ , minibatch_cost = sess.run([optimizer, cost],feed_dict={X:minibatch_X, Y: minibatch_Y})
0 条评论
下一页