茫然中不知道该做什么,更看不到希望。
偶然看到coursera上有Andrew Ng教授的机器学习课程以及他UFLDL上的深度学习课程,于是静下心来,视频一个个的看,作业一个一个的做,程序一个一个的写。N多数学的不懂、Matlab不熟悉,开始的时候学习进度慢如蜗牛,坚持了几个月,终于也学完了。为了避免遗忘,在这里记下一些内容。由于水平有限,Python也不是太熟悉,英语也不够好,有错误或不当的地方,请不吝赐教。
对于softmax背后的理论还不是很清楚,不知道是来自信息论还是概率。不过先了解个大概,先用起来,背后的理论再慢慢补充。
softmax的基本理论:
对于给定的输入x和输出有,K类分类器每类的概率为P(y=k|x,Θ),即
模型参数 θ(1),θ(2),…,θ(K)∈Rn ,矩阵θ以K*n的形式比较方便(其中n为输入x的维度或特征数)。
softmax回归的代价函数:
其中1{y(i)=k}为指示函数,即y(i)为k时其值为1,否则为0,或则说括号内的表达式为真时其值为1,否则为0
梯度公式:
在实现此模型时遇到了2个问题,卡了一段时间:
1. 指示函数如何实现?我的实现方法:把y转换为一个k个元素的向量yv,如果y=i,则yv[i]=1,其他位置为零。在代价函数中用这个向量和概率P元素相乘、在梯度公式中与概率P相减即可实现指示函数。
2. 对矩阵不太熟练,矢量化花费了不少时间。
对概率P的参数θ进行平移得到结果和原概率一致,因此可得到参数θ是有冗余的结论。解决方法有2种,第一种是在代价函数和梯度中加上L2范式惩罚项,这种方式又增加了一个自由参数:惩罚项系数。第二种方式是固定某个类的参数为零,这样的方式不影响最终的分类结果。在我的实现方式里使用第二种方式。
教程里提到的梯度检验的方法非常有效,可以有效验证代价函数和梯度实现是否正确。只要通过梯度检验,一般都能得到正确的结果。
UFLDL教程上的练习是Matlab,由于对Matlab熟悉度不够,我使用Python+numpy+scipy来实现。代码的意义参考代码中注释。
第一段代码是一个抽象的监督学习的模型类,可以用于神经网络等监督学习模型。
1 import numpy as np 2 from dp.common.optimize import minFuncSGD 3 import scipy.optimize as spopt 4 5 class SupervisedLearningModel(object): 6 7 def flatTheta(self): 8 ''' 9 convert weight and intercept to 1-dim vector 10 ''' 11 pass 12 13 def rebuildTheta(self,theta): 14 ''' 15 overwrite the method in SupervisedLearningModel 16 convert 1-dim theta to weight and intercept 17 Parameters: 18 theta - The vector hold the weights and intercept, needed by scipy.optimize function 19 size:outputSize*inputSize 20 ''' 21 22 def cost(self, theta,X,y): 23 ''' 24 This method is used to some optimize function such as fmin_cg,fmin_l_bfgs_b in scipy.optimize 25 Parameters: 26 theta - 1-Dim vector of weight 27 X - samples, numFeatures by numSamples 28 y - labels, numSamples elements vector 29 return: 30 the model cost 31 ''' 32 pass 33 34 def gradient(self, theta,X,y): 35 ''' 36 This method is used to some optimize function such as fmin_cg,fmin_l_bfgs_b in scipy.optimize 37 Parameters: 38 theta - 1-Dim vector of weight 39 X - samples, numFeatures by numSamples 40 y - labels, numSamples elements vector 41 return: 42 the model gradient 43 ''' 44 pass 45 46 def costFunc(self,theta,X,y): 47 ''' 48 This method is used to some optimize function such as minFuncSGD in this package 49 Parameters: 50 theta - 1-Dim vector of weight 51 X - samples, numFeatures by numSamples 52 y - labels, numSamples elements vector 53 return: 54 the model cost and gradient 55 ''' 56 pass 57 58 def predict(self, Xtest): 59 ''' 60 predict the test samples 61 Parameters: 62 X - test samples, numFeatures by numSamples 63 return: 64 the predict result,a vector, numSamples elements 65 ''' 66 pass 67 68 def performance(self,Xtest,ytest): 69 ''' 70 Before calling this method, this model should be training 71 Parameter: 72 Xtest - The data to be predicted, numFeatures by numData 73 ''' 74 pred = self.predict(Xtest) 75 return np.mean(pred == ytest) * 100 76 77 def train(self,X,y): 78 ''' 79 use this method to train the model. 80 Parameters: 81 theta - 1-Dim vector of weight 82 X - samples, numFeatures by numSamples 83 y - labels, numSamples elements vector 84 ''' 85 theta =self.flatTheta() 86 87 ret = spopt.fmin_l_bfgs_b(self.cost, theta, fprime=self.gradient,args=(X,y),m=200,disp=1, maxiter=100) 88 opttheta= ret[0] 89 90 ''' 91 opttheta = spopt.fmin_cg(self.cost, theta, fprime=self.gradient,args=(X,y),full_output=False,disp=True, maxiter=100) 92 ''' 93 ''' 94 options=dict() 95 options['epochs']=10 96 options['alpha'] = 2 97 options['minibatch']=256 98 opttheta = minFuncSGD(self.costFunc,theta,X,y,options) 99 100 '''101 self.rebuildTheta(opttheta)
第二段代码定义了一个单一神经网络层NNLayer,从第一段代码中的SupervisedModel类继承下来。它在,softmax和多层神经网络中用得到。
1 class NNLayer(SupervisedLearningModel): 2 ''' 3 This class is single layer of Neural network 4 ''' 5 def __init__(self, inputSize,outputSize,Lambda,actFunc='sigmoid'): 6 ''' 7 Constructor: initialize one layer w.r.t params 8 parameters : 9 inputSize - the number of input elements 10 outputSize - the number of output 11 lambda - weight decay parameter 12 actFunc - the can be sigmoid,tanh,rectified linear function 13 ''' 14 super().__init__() 15 self.inputSize = inputSize 16 self.outputSize = outputSize 17 self.Lambda = Lambda 18 self.actFunc=sigmoid 19 self.actFuncGradient=sigmodGradient 20 21 self.input=0 #input of this layer 22 self.activation=0 #output of the layer 23 self.delta=0 #the error of this layer 24 self.W=0 #the weight 25 self.b=0 #the intercept 26 27 if actFunc=='sigmoid': 28 self.actFunc = sigmoid 29 self.actFuncGradient = sigmodGradient 30 if actFunc=='tanh': 31 self.actFunc = tanh 32 self.actFuncGradient =tanhGradient 33 if actFunc=='rectfiedLinear': 34 self.actFunc = rectfiedLinear 35 self.actFuncGradient = rectfiedLinearGradient 36 37 #epsilon的值是一个经验公式 38 #initialize weights and intercept (bias) 39 epsilon_init = 2.4495/np.sqrt(self.inputSize+self.outputSize)*0.001 40 theta = np.random.rand(self.outputSize, self.inputSize + 1) * 2 * epsilon_init - epsilon_init 41 self.rebuildTheta(theta) 42 43 def flatTheta(self): 44 ''' 45 convert weight and intercept to 1-dim vector 46 ''' 47 W = np.hstack((self.W, self.b)) 48 return W.ravel() 49 50 def rebuildTheta(self,theta): 51 ''' 52 overwrite the method in SupervisedLearningModel 53 convert 1-dim theta to weight and intercept 54 Parameters: 55 theta - The vector hold the weights and intercept, needed by scipy.optimize function 56 size:outputSize*inputSize 57 ''' 58 W=theta.reshape(self.outputSize,-1) 59 self.b=W[:,-1].reshape(self.outputSize,1) #bias b is a vector with outputSize elements 60 self.W = W[:,:-1] 61 62 def forward(self): 63 ''' 64 Parameters: 65 X - The examples in a matrix, 66 it's dimensionality is inputSize by numSamples 67 ''' 68 Z = np.dot(self.W,self.input)+self.b #Z 69 self.activation= self.actFunc(Z) #activations 70 return self.activation 71 72 def backpropagate(self): 73 ''' 74 parameter: 75 inputMat - the actviations of previous layer, or input of this layer, 76 inputSize by numSamples 77 delta - the next layer error term, outputSize by numSamples 78 79 assume current layer number is l, 80 delta is the error term of layer l+1. 81 delta(l) = (W(l).T*delta(l+1)).f'(z) 82 If this layer is the first hidden layer,this method should not 83 be called 84 The f' is re-writed to void the second call to the activation function 85 ''' 86 return np.dot(self.W.T,self.delta)*self.actFuncGradient(self.input) 87 88 def layerGradient(self): 89 ''' 90 grad_W(l)=delta(l+1)*input.T 91 grad_b(l) = SIGMA(delta(l+1)) 92 parameters: 93 inputMat - input of this layer, inputSize by numSamples 94 delta - the next layer error term 95 ''' 96 m=self.input.shape[1] 97 gw = np.dot(self.delta,self.input.T)/m 98 gb = np.sum(self.delta,1)/m 99 #combine gradients of weights and intercepts100 #and flat it101 grad = np.hstack((gw, gb.reshape(-1,1)))102 103 return grad104 105 106 def sigmoid(Z):107 return 1.0 /(1.0 + np.exp(-Z))108 109 def sigmodGradient (a):110 #a = sigmoid(Z)111 return a*(1-a)112 113 def tanh(Z):114 e1=np.exp(Z)115 e2=np.exp(-Z)116 return (e1-e2)/(e1+e2)117 118 def tanhGradient(a):119 return 1-a**2120 121 def rectfiedLinear(Z):122 a = np.zeros(Z.shape)+Z123 a[a<0]=0124 return a125 126 def rectfiedLinearGradient(a):127 b = np.zeros(a.shape)+a 128 b[b>0]=1129 return b
第三段代码是softmax回归的实现,它从NNLayer继承。
1 import numpy as np 2 #import scipy.optimize as spopt 3 from dp.supervised import NNBase 4 from time import time 5 #from dp.common.optimize import minFuncSGD 6 class SoftmaxRegression(NNBase.NNLayer): 7 ''' 8 We assume the last class weight to be zeros in this implementation. 9 The weight decay is not used here. 10 11 ''' 12 def __init__(self, numFeatures, numClasses,Lambda=0): 13 ''' 14 Initialization of weights,intercepts and other members 15 Parameters: 16 numClasses - The number of classes to be classified 17 X - The training samples, numFeatures by numSamples 18 y - The labels of training samples, numSamples elements vector 19 ''' 20 21 # call the super constructor to initialize the weights and intercepts 22 # We do not need the last weights and intercepts of the last class 23 super().__init__(numFeatures, numClasses - 1, Lambda, None) 24 25 #self.X=0 26 self.y_mat=0 27 28 def predict(self, Xtest): 29 ''' 30 Prediction. 31 Before calling this method, this model should be training 32 Parameter: 33 Xtest - The data to be predicted, numFeatures by numData 34 ''' 35 Z = np.dot(self.W, Xtest) + self.b 36 #add the prediction of the last class,they are all zeros 37 lastClass = np.zeros((1, Xtest.shape[1])) 38 Z = np.vstack((Z, lastClass)) 39 #get the index of max value in each column, it is the prediction 40 return np.argmax(Z, 0) 41 42 def forward(self): 43 ''' 44 get the matrix of softmax hypothesis 45 this method will be called by cost and gradient methods 46 Parameters: 47 48 ''' 49 h = np.dot(self.W, self.input) + self.b 50 h = np.exp(h) 51 #add probabilities of the last class, they are all ones 52 h = np.vstack((h, np.ones((1, self.input.shape[1])))) 53 #The probability of all classes 54 hsum = np.sum(h, axis=0) 55 #get the probability of each class 56 self.activation = h / hsum 57 #delta = -(self.y_mat-h) 58 self.delta = self.activation - self.y_mat 59 self.delta=self.delta[:-1, :] 60 61 return self.activation 62 63 def setTrainingLabels(self,y): 64 # convert Vector y to a matrix y_mat. 65 # For sample i, if it belongs to the k-th class, 66 # y_mat[k,i]=1 (k==j), y_mat[k,i]=0 (k!=j) 67 y = y.astype(np.int64) 68 m=y.shape[0] 69 yy = np.arange(m) 70 self.y_mat = np.zeros((self.outputSize+1, m)) 71 self.y_mat[y, yy] = 1 72 73 def softmaxforward(self,theta,X,y): 74 self.input = X 75 self.setTrainingLabels(y) 76 self.rebuildTheta(theta) 77 return self.forward() 78 79 def cost(self, theta,X,y): 80 ''' 81 The cost function. 82 Parameters: 83 theta - The vector hold the weights and intercept, needed by scipy.optimize function 84 size: (numClasses - 1)*(numFeatures + 1) 85 ''' 86 h = np.log(self.softmaxforward(theta,X,y)) 87 #h * self.y_mat, apply the indicator function 88 cost = -np.sum(h *self.y_mat, axis=(0, 1)) 89 90 return cost / X.shape[1] 91 92 def gradient(self, theta,X,y): 93 ''' 94 The gradient function. 95 Parameters: 96 theta - The vector hold the weights and intercept, needed by scipy.optimize function 97 size: (numClasses - 1)*(numFeatures + 1) 98 ''' 99 self.softmaxforward(theta,X,y) 100 101 #get the gradient102 grad = super().layerGradient()103 104 return grad.ravel()105 106 def costFunc(self,theta,X,y):107 108 grad=self.gradient(theta, X, y)109 h=np.log(self.activation)110 cost = -np.sum(h * self.y_mat, axis=(0, 1))/X.shape[1]111 return cost,grad 112 113 114 def checkGradient(X,y):115 116 sm = SoftmaxRegression(X.shape[0], 10)117 #W = np.hstack((sm.W, sm.b))118 #sm.setTrainData(X, y)119 theta = sm.flatTheta() 120 #grad = sm.gradient(theta,X, y)121 cost,grad=sm.costFunc(theta, X, y) 122 numgrad = np.zeros(grad.shape)123 124 e = 1e-6125 126 for i in range(np.size(grad)): 127 theta[i]=theta[i]-e128 loss1,g1 =sm.costFunc(theta,X, y)129 theta[i]=theta[i]+2*e130 loss2,g2 = sm.costFunc(theta,X, y)131 theta[i]=theta[i]-e 132 133 numgrad[i] = (-loss1 + loss2) / (2 * e)134 135 print(np.sum(np.abs(grad-numgrad))/np.size(grad))
测试数据使用MNIST数据集。测试结果,正确率在92.5%左右。
测试代码:
1 X = np.load('../../common/trainImages.npy') / 255 3 2 y = np.load('../../common/trainLabels.npy') 4 ''' 5 X1=X[:,:10] 6 y1=y[:10] 7 checkGradient(X1,y1) 8 ''' 9 Xtest = np.load('../../common/testImages.npy') / 25511 10 ytest = np.load('../../common/testLabels.npy')12 sm = SoftmaxRegression(X.shape[0], 10)13 t0=time()14 sm.train(X,y)15 print('training Time %.5f s' %(time()-t0))16 17 print('test acc :%.3f%%' % (sm.performance(Xtest,ytest)))
参考资料: