The credit for this answer goes to Antoni Parellada from the comments, which I think deserves a more prominent place on this page (as it helped me out when many other answers did not). Also, this is not a full derivation but more of a clear statement of ∂J(θ)∂θ. (For full derivation, see the other answers).
∂J(θ)∂θ=1m⋅XT(σ(Xθ)−y)
where
X∈Rm×nσ(z)θ∈Rny=Training example matrix=11+e−z=sigmoid function=logistic function=weight row vector=class/category/label corresponding to rows in X
Also, a Python implementation for those wanting to calculate the gradient of J with respect to θ.
import numpy
def sig(z):
return 1/(1+np.e**-(z))
def compute_grad(X, y, w):
"""
Compute gradient of cross entropy function with sigmoidal probabilities
Args:
X (numpy.ndarray): examples. Individuals in rows, features in columns
y (numpy.ndarray): labels. Vector corresponding to rows in X
w (numpy.ndarray): weight vector
Returns:
numpy.ndarray
"""
m = X.shape[0]
Z = w.dot(X.T)
A = sig(Z)
return (-1/ m) * (X.T * (A - y)).sum(axis=1)