2022版Assignment的作业

Q1: k-Nearest Neighbor classifier

如果你是安装有wget，那么请跳过这一提示。如果你是Windows系统，请在下载地址处下载Cifar数据集放至./cs231n/datasets/cifar-10-batches-py内，不推荐执行第一块代码块进行数据集下载。

导入需要的库

# Run some setup code for this notebook.

import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt

#可以将matlab图表嵌入到jupyter中
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload

载入CIFAR-10数据

# 设置你的数据集路径，对应你所下载的位置
cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'

# 清除先前存储的数据，如果不使用jupyter那么不必做这一步
try:
   del X_train, y_train
   del X_test, y_test
   print('Clear previously loaded data.')
except:
   pass

X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)

#我们需要对训练和测试数据的维度进行校验来防止出现一些错误
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)

可以看到我们的训练集有50000张，测试集有10000张，皆是32x32的三通道图像

Training data shape: (50000, 32, 32, 3) Training labels shape: (50000,) Test data shape: (10000, 32, 32, 3) Test labels shape: (10000,)

可视化一些数据看看吧

 #类别标签
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship','truck'] 
num_classes = len(classes)
samples_per_class = 7
for y, cls in enumerate(classes):
    idxs = np.flatnonzero(y_train == y) #flatnonzero会返回其非0索引，选中当类别为y的所有label
    idxs = np.random.choice(idxs, samples_per_class, replace=False)#idx内取samples_per_class个样本，不可以有相同数字
    for i, idx in enumerate(idxs): #画上子图
        plt_idx = i * num_classes + y + 1
        plt.subplot(samples_per_class, num_classes, plt_idx)
        plt.imshow(X_train[idx].astype('uint8'))
        plt.axis('off')
        if i == 0:
            plt.title(cls)
plt.show()

对数据集进行子采样

#训练集取5000个样本，测试集取500个样本，并且对后三个通道进行降维展开
num_training = 5000
# mask = list(range(num_training))
# X_train = X_train[mask]   #官方代码
# y_train = y_train[mask]

X_train = X_train[:num_training]
y_train = y_train[:num_training] #实际上直接可以这么写

num_test = 500

X_test = X_test[:num_test]
y_test = y_test[:num_test]

# Reshape the image data into rows
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
print(X_train.shape, X_test.shape)

(5000, 3072) (500, 3072)

训练最近邻分类器

先来看看训练代码

看上去什么都没做，对吗?这仅仅只是将数据储存了起来。

def train(self, X, y):
    """
    Train the classifier. For k-nearest neighbors this is just
    memorizing the training data.Inputs:
- X: A numpy array of shape (num_train, D) containing the training data
  consisting of num_train samples each of dimension D.
- y: A numpy array of shape (N,) containing the training labels, where
     y[i] is the label for X[i].
     """
     self.X_train = X
     self.y_train = y

训练

from cs231n.classifiers import KNearestNeighbor

# Create a kNN classifier instance. 
# Remember that training a kNN classifier is a noop: 
# the Classifier simply remembers the data and does no further processing 
classifier = KNearestNeighbor() #实例化
classifier.train(X_train, y_train)	#训练

We would now like to classify the test data with the kNN classifier. Recall that we can break down this process into two steps:

First we must compute the distances between all test examples and all train examples.

Given these distances, for each test example we find the k nearest examples and have them vote for the label

Lets begin with computing the distance matrix between all training and test examples. For example, if there are Ntr training examples and Nte test examples, this stage should result in a Nte x Ntr matrix where each element (i,j) is the distance between the i-th test and j-th train example.

Note: For the three distance computations that we require you to implement in this notebook, you may not use the np.linalg.norm() function that numpy provides.

First, open cs231n/classifiers/k_nearest_neighbor.py and implement the function compute_distances_two_loops that uses a (very inefficient) double loop over all pairs of (test, train) examples and computes the distance matrix one element at a time.

计算距离矩阵

我们需要自己编写算法来计算每一张测试图片与所有的训练图片的L2距离(欧式距离)，输出距离矩阵以便后期处理

#我愿称之为不动脑子算法
def compute_distances_two_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using a nested loop over both the training data and the
    test data.

    Inputs:
    - X: A numpy array of shape (num_test, D) containing test data.

    Returns:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
      is the Euclidean distance between the ith test point and the jth training
      point.
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in range(num_test):
        for j in range(num_train):
            #####################################################################
            # TODO:                                                             #
            # Compute the l2 distance between the ith test point and the jth    #
            # training point, and store the result in dists[i, j]. You should   #
            # not use a loop over dimension, nor use np.linalg.norm().          #
            #####################################################################
            # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

             distance = np.sqrt(np.sum(np.square(X[i]-self.X_train[j])))
             dists[i,j]=distance

            # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    return dists

#利用广播机制即可解决
def compute_distances_one_loop(self, X):
      """
      Compute the distance between each test point in X and each training point
      in self.X_train using a single loop over the test data.

      Input / Output: Same as compute_distances_two_loops
      """
      num_test = X.shape[0]
      num_train = self.X_train.shape[0]
      dists = np.zeros((num_test, num_train))
      for i in range(num_test):
          #######################################################################
          # TODO:                                                               #
          # Compute the l2 distance between the ith test point and all training #
          # points, and store the result in dists[i, :].                        #
          # Do not use np.linalg.norm().                                        #
          #######################################################################
          # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

          diatance = np.sqrt(np.sum(np.square(X[i]-self.X_train),1))
          dists[i] = diatance

          # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
      return dists

#猪脑要过载了，脑子里推演一遍应该是没错的，后期会进行笔头验证
 def compute_distances_no_loops(self, X):
     """
     Compute the distance between each test point in X and each training point
     in self.X_train using no explicit loops.

     Input / Output: Same as compute_distances_two_loops
     """
     num_test = X.shape[0]
     num_train = self.X_train.shape[0]
     dists = np.zeros((num_test, num_train))
     #########################################################################
     # TODO:                                                                 #
     # Compute the l2 distance between all test points and all training      #
     # points without using any explicit loops, and store the result in      #
     # dists.                                                                #
     #                                                                       #
     # You should implement this function using only basic array operations; #
     # in particular you should not use functions from scipy,                #
     # nor use np.linalg.norm().                                             #
     #                                                                       #
     # HINT: Try to formulate the l2 distance using matrix multiplication    #
     #       and two broadcast sums.                                         #
     #########################################################################
     # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

     mid = -2*np.dot(X,self.X_train.T)
     pre = np.sum(np.square(X),1)
     las = np.sum(np.square(self.X_train),1)
     dists = np.sqrt(las[np.newaxis,:]+pre[:, np.newaxis]+mid)
     # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
     return dists

Inline Question 1

Notice the structured patterns in the distance matrix, where some rows or columns are visibly brighter. (Note that with the default color scheme black indicates low distances while white indicates high distances.)

距离矩阵呈现出结构化图案，黑色表示低距离，白色表示高距离

What in the data is the cause behind the distinctly bright rows?
What causes the columns?

距离矩阵代表测试图像与训练图像间的像素差异的大小，明亮行则说明该测试图像与几乎所有的训练图像都有着较大差异。我认为这些造成明亮的原因可能是该测试图像与训练图像的像素分布距离较远，属于离群数据所以导致图像差异过大。比如一张几乎纯白或者纯黑的画面很明显这与训练图像差别巨大。

同理如上

预测函数的编写

def predict_labels(self, dists, k=1):
    """
    Given a matrix of distances between test points and training points,
    predict a label for each test point.

    Inputs:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
      gives the distance betwen the ith test point and the jth training point.

    Returns:
    - y: A numpy array of shape (num_test,) containing predicted labels for the
      test data, where y[i] is the predicted label for the test point X[i].
    """
    num_test = dists.shape[0]
    y_pred = np.zeros(num_test)
    for i in range(num_test):
        # A list of length k storing the labels of the k nearest neighbors to
        # the ith test point.
        closest_y = []
        #########################################################################
        # TODO:                                                                 #
        # Use the distance matrix to find the k nearest neighbors of the ith    #
        # testing point, and use self.y_train to find the labels of these       #
        # neighbors. Store these labels in closest_y.                           #
        # Hint: Look up the function numpy.argsort.                             #
        #########################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        arg_dists = np.argsort(dists[i]) #对第i行做一个排序输出索引
        closest_y = self.y_train[arg_dists]#对应索引拿到train_label

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #########################################################################
        # TODO:                                                                 #
        # Now that you have found the labels of the k nearest neighbors, you    #
        # need to find the most common label in the list closest_y of labels.   #
        # Store this label in y_pred[i]. Break ties by choosing the smaller     #
        # label.                                                                #
        #########################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        label = closest_y[:k]	#取前k个值做投票
        count = np.bincount(label)#计算各数出现次数
        result = np.argmax(count)#取个最大的输出其索引
        y_pred[i] = result #储存结果
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    return y_pred

预测

# Now implement the function predict_labels and run the code below:
# We use k = 1 (which is Nearest Neighbor).
y_test_pred = classifier.predict_labels(dists, k=1)

# Compute and print the fraction of correctly predicted examples
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

Got 137 / 500 correct => accuracy: 0.274000

可以看到正确率大概在27%左右，这是一个十分正常的现象

y_test_pred = classifier.predict_labels(dists, k=5)
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

Got 139 / 500 correct => accuracy: 0.278000

可以看到性能有略微提升(多对了两个😅)，这也是正常现象

Inline Question 2

所有图像全部减去一个统计平均值，L1距离不变

每个像素位置，减去属于他的统计平均值，总体L1距离不变

所有像素减去平均值不会影响，但是除以标准差(所有图片的标准差)会改变L1距离，但不影响其排序

同上

不像L2距离，改变坐标系的话L1距离会变。可以自己尝试绘制二位情况，然后推广到三维空间。不同像素的L1距离可能会增大也可能会见减小，因此，会影响结果。

验证一下我们写的算法是否正确吧

使用Frobenis范数进行距离矩阵差异计算，实现方法就是将其像素差值平方和开根

# 验证我们的O(n)算法是否正确
# Now lets speed up distance matrix computation by using partial vectorization
# with one loop. Implement the function compute_distances_one_loop and run the
# code below:
dists_one = classifier.compute_distances_one_loop(X_test)

# To ensure that our vectorized implementation is correct, we make sure that it
# agrees with the naive implementation. There are many ways to decide whether
# two matrices are similar; one of the simplest is the Frobenius norm. In case
# you haven't seen it before, the Frobenius norm of two matrices is the square
# root of the squared sum of differences of all elements; in other words, reshape
# the matrices into vectors and compute the Euclidean distance between them.
difference = np.linalg.norm(dists - dists_one, ord='fro')
print('One loop difference was: %f' % (difference, ))
if difference < 0.001:
    print('Good! The distance matrices are the same')
else:
    print('Uh-oh! The distance matrices are different')

One loop difference was: 0.000000 Good! The distance matrices are the same

# 验证我们的O(1)算法是否正确
# Now implement the fully vectorized version inside compute_distances_no_loops
# and run the code
dists_two = classifier.compute_distances_no_loops(X_test)

# check that the distance matrix agrees with the one we computed before:
difference = np.linalg.norm(dists - dists_two, ord='fro')
print('No loop difference was: %f' % (difference, ))
if difference < 0.001:
    print('Good! The distance matrices are the same')
else:
    print('Uh-oh! The distance matrices are different')

No loop difference was: 0.000000 Good! The distance matrices are the same

评价算法性能

# Let's compare how fast the implementations are
def time_function(f, *args):
    """
    Call a function f with args and return the time (in seconds) that it took to execute.
    """
    import time
    tic = time.time()
    f(*args)
    toc = time.time()
    return toc - tic

two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)
print('Two loop version took %f seconds' % two_loop_time)

one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)
print('One loop version took %f seconds' % one_loop_time)

no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)
print('No loop version took %f seconds' % no_loop_time)

# You should see significantly faster performance with the fully vectorized implementation!

# NOTE: depending on what machine you're using, 
# you might not see a speedup when you go from two loops to one loop, 
# and might even see a slow-down.

Two loop version took 30.608045 seconds One loop version took 46.391969 seconds No loop version took 0.177659 seconds

可以看到差距还是蛮大的，我的中央处理器为intel 10875H，如果你的电脑性能不好，差距将会更大。

交叉验证

编写交叉验证函数

num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]

X_train_folds = []
y_train_folds = []
################################################################################
# TODO:                                                                        #
# Split up the training data into folds. After splitting, X_train_folds and    #
# y_train_folds should each be lists of length num_folds, where                #
# y_train_folds[i] is the label vector for the points in X_train_folds[i].     #
# Hint: Look up the numpy array_split function.                                #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

X_train_folds = np.array(np.array_split(X_train,num_folds))	#在这里进行数据集划分
y_train_folds = np.array(np.array_split(y_train,num_folds))

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
k_to_accuracies = {}


################################################################################
# TODO:                                                                        #
# Perform k-fold cross validation to find the best value of k. For each        #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times,   #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all     #
# values of k in the k_to_accuracies dictionary.                               #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

for i in k_choices:	#对每个k值做交叉验证
    k_to_accuracies[i]=[]
    for j in range(num_folds):
        test_data = X_train_folds[j]
        train_data = X_train_folds[[t for t in range(num_folds) if t!=j ],:].reshape(-1,3072)
        test_label = y_train_folds[j]
        train_label = y_train_folds[[t for t in range(num_folds) if t!=j ],:].flatten()
        classifier.train(train_data, train_label)
        dists_cross = classifier.compute_distances_no_loops(test_data)


        predict_labels = classifier.predict_labels(dists_cross, k=i)
        num_correct = np.sum(test_label == predict_labels)
        num_test = len(test_data)
        accuracy = float(num_correct) / num_test
        k_to_accuracies[i].append(accuracy)
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

# Print out the computed accuracies
for k in sorted(k_to_accuracies):
    for accuracy in k_to_accuracies[k]:
        print('k = %d, accuracy = %f' % (k, accuracy))

部分结果:

k = 10, accuracy = 0.284000 k = 10, accuracy = 0.280000 k = 12, accuracy = 0.260000 k = 12, accuracy = 0.295000 k = 12, accuracy = 0.279000 k = 12, accuracy = 0.283000 k = 12, accuracy = 0.280000 k = 15, accuracy = 0.252000 k = 15, accuracy = 0.289000 k = 15, accuracy = 0.278000 k = 15, accuracy = 0.282000

可视化结果

# plot the raw observations
for k in k_choices:
    accuracies = k_to_accuracies[k]
    plt.scatter([k] * len(accuracies), accuracies)

# plot the trend line with error bars that correspond to standard deviation
accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])
accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
plt.title('Cross-validation on k')
plt.xlabel('k')
plt.ylabel('Cross-validation accuracy')
plt.show()

这里就是交叉验证的可视化结果。其中，每个点表示当前k值下的准确率。折线为数据的平均准确率以及标准差

挑选最合适的参数

# Based on the cross-validation results above, choose the best value for k,   
# retrain the classifier using all the training data, and test it on the test
# data. You should be able to get above 28% accuracy on the test data.
#在这里以可视化数据为参考选取最好的k值达到28%正确率
best_k = 10
num_test=500 #前面num_test改成1000了，这里改回来
classifier = KNearestNeighbor()
classifier.train(X_train, y_train)

y_test_pred = classifier.predict(X_test, k=best_k)

# Compute and display the accuracy
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

Got 141 / 500 correct => accuracy: 0.282000

Inline Question 3

Which of the following statements about -Nearest Neighbor (-NN) are true in a classification setting, and for all ? Select all that apply.

The decision boundary of the k-NN classifier is linear.
The training error of a 1-NN will always be lower than or equal to hat of 5-NN.
The test error of a 1-NN will always be lower than that of a 5-NN.
The time needed to classify a test example with the k-NN classifier grows with the size of the training set.
None of the above.

看到这篇文章的cover可知，不一定是线性的

这个不一定的，比较玄学，得看数据

同上

是的，因为需要计算测试集图片与所有训练图片的距离，因此训练集大小会影响预测速度

不选