Preprocessing training list with sklearn

Question

I have mnist training list in the following form:

def load_data():
    f = gzip.open('mnist.pkl.gz', 'rb')
    training_data, validation_data, test_data = cPickle.load(f, encoding='latin1')
    f.close()
def load_data_wrapper():
    tr_d, va_d, te_d = load_data()
    training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
    training_results = [vectorized_result(y) for y in tr_d[1]]
    training_data = list(zip(training_inputs, training_results))
    ........................................

Now I would like to preprocess my training inputs to have zero mean and unit variance. So I used from sklearn import preprocessing in the following:

def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):

        if test_data: n_test = len(test_data)
        preprocessed_training = preprocessing.scale(training_data)
        n = len(preprocessed_training)
        for j in range(epochs):
            random.shuffle(preprocessed_training)
            mini_batches = [
                training_data[k:k+mini_batch_size].....
                ....................

However, I'm getting the following error:

ValueError: setting an array element with a sequence.

I'm modifying code from mnielsen that can be found here. I'm new in python and machine learning in general. I would appreciate if anyone can help me out. Note: If you think there is a better library option then please let me know as well.

Update_1: This was my another try which gives the same error.

    scaler = StandardScaler()
    scaler.fit(training_data)
    training_data = scaler.transform(training_data)
    if test_data: test_data = scaler.transform(test_data)

Update_2: I tried the solution provided in the suggested answer using pandas dataframe but I am still getting the same error.

Update_3 : So it's object type but I need float type to perform scaler. I did the following: training_data = np.asarray(training_data).astype(np.float64) and I still get the error!

Update_4 : General mnist dataset structure: 50k training images, 10k test images. In 50k images, each image is 28 * 28 pixels , which gives 784 data points. For example, a data point in MNIST, if it's original output is 5 then it's (array([ 0., 0., 0., ..., 0., 0., 0.], dtype=float32), 5) tuple.You can see that first element in the tuple is a sparse matrix. Here is an example of the training dataset, first element of the tuple (i.e. the input image with 784 greyscaled floats). Along second element of the tuple, we just give output as a number 0 through 9. However, in one hot encoding, we give a 10D vector where all index values are zeros except for the index of the output value. So for number 5 it will be [[0],[0],[0],[0],[0],[1],[0],[0],[0],[0]]. The wrapper modification that I'm using can be found here.

score 0 · Answer 1 · answered Nov 23 '17 at 22:38

0

I do this in a bit of a different way. Recall that you must scale your training and testing sets by the same function which is built from all your training data. Also, you only want to manipulate your features. I would start by converting to a train and test dataframe and a list of features.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(train[features])   
X_train = pd.DataFrame(scaler.transform(train[features]),columns = train[features].columns)
X_test = pd.DataFrame(scaler.transform(test[features]),columns = test[features].columns)

Does this work? Is there a reason you need to use batches?

answered Nov 23 '17 at 22:38

Keith

4,646
7
43
72

I'm following along the online book {http://neuralnetworksanddeeplearning.com/chap3.html} . The writer used mini-batches instead of inputting whole dataset at the same time to expedite the learning process. I also think it's important. – Raafi Nov 24 '17 at 00:11
I would suggest you simplify, and look to see if you get what you expect at each step. Dataframes will help. Your data is built in the wrong way to be read properly. Your error is a result of this. Look at this post https://stackoverflow.com/questions/4674473/valueerror-setting-an-array-element-with-a-sequence Your error is likely unrelated to machine learning. – Keith Nov 24 '17 at 18:02
I used pandas dataframe as you said but I'm still getting the same error! – Raafi Nov 24 '17 at 22:30
Look at the df.dtypes and the content each column. The error implies that one of them is not formatted correctly – Keith Nov 24 '17 at 22:34
Sorry, I cannot help more than this. The error strictly says you are giving a list (ie [1,2,3]) when it expects a single number. This is why I thought dataframe would help make it more visually clear or that it was an issue with your type( strings are lists of chars). – Keith Nov 25 '17 at 00:10
So I have list of object type tuples as (a,b) where 'a' is float32 type 2D array(782,1) and 'b' is int type 2D array (10,1) where b[5] = 1 if the number is 5 in mnist. How do I preprocess this data type? – Raafi Nov 25 '17 at 02:04
In your dataframe each column should have one object per row. Nothing else will work. You said before that this was a float. A list of floats will not work. I did not understand your above comment. Update your post to show the first 10 rows of your dataframe – Keith Nov 25 '17 at 03:39
Show your dataframe as asked. Should be 784 wide for your features but just a few columns and row are fine – Keith Nov 25 '17 at 07:17

score 0 · Accepted Answer · answered Nov 27 '17 at 05:31

The problem I was having is because of the fact that from sklearn.preprocessing import StandardScaler changes dimension of my data. Instead of using StandardScaler , I just used preprocessing.scale for each input in my (50k,(784,1)) dim dataset. That is, I used the scale function to each (784,1) data on axis = 1 and added them using a for loop. This slowed down the program but worked. If anyone knows a better way please let me know in the answer section.

Preprocessing training list with sklearn

2 Answers2