Mohith's blog

Detection OF RansomWare

Here's a breakdown of each line of the code:

  1. import numpy as np: Imports the NumPy library and aliases it as np for ease of use.
  2. import pandas as pd: Imports the pandas library and aliases it as pd.
  3. import seaborn as sns: Imports the seaborn library and aliases it as sns.
  4. from sklearn.preprocessing import LabelEncoder: Imports the LabelEncoder class from scikit-learn's preprocessing module.
  5. import os: Imports the os module for operating system-related functions.
  6. from keras.callbacks import ModelCheckpoint: Imports the ModelCheckpoint callback from Keras for saving model weights during training.
  7. import pickle: Imports the pickle module for serializing and deserializing Python objects.
  8. from keras.layers import LSTM: Imports the LSTM class from Keras for Long Short-Term Memory networks.
  9. from keras.utils.np_utils import to_categorical: Imports the to_categorical function from Keras for one-hot encoding.
  10. from keras.layers import MaxPooling2D: Imports the MaxPooling2D layer from Keras for 2D max pooling.
  11. from keras.layers import Dense, Dropout, Activation, Flatten: Imports various layers from Keras for building neural networks.
  12. from keras.layers import Convolution2D: Imports the Convolution2D layer from Keras for 2D convolutions.
  13. from keras.models import Sequential, Model: Imports the Sequential and Model classes from Keras for defining neural network models.
  14. from tensorflow.keras.layers import Conv2D: Imports the Conv2D layer from TensorFlow for 2D convolutions.
  15. from sklearn.preprocessing import MinMaxScaler: Imports the MinMaxScaler class from scikit-learn for feature scaling.
  16. from sklearn.metrics import accuracy_score: Imports the accuracy_score function from scikit-learn for computing accuracy.
  17. from sklearn.model_selection import train_test_split: Imports the train_test_split function from scikit-learn for splitting data into training and testing sets.
  18. from sklearn.feature_selection import RFECV: Imports the RFECV class from scikit-learn for feature selection.
  19. from sklearn.metrics import precision_score: Imports the precision_score function from scikit-learn for computing precision.
  20. from sklearn.metrics import recall_score: Imports the recall_score function from scikit-learn for computing recall.
  21. from sklearn.metrics import f1_score: Imports the f1_score function from scikit-learn for computing F1 score.
  22. from sklearn.metrics import confusion_matrix: Imports the confusion_matrix function from scikit-learn for computing confusion matrices.
  23. import matplotlib.pyplot as plt: Imports the pyplot module from matplotlib for creating plots.
  24. from sklearn.tree import DecisionTreeClassifier: Imports the DecisionTreeClassifier class from scikit-learn for decision tree classification.
  25. from sklearn.ensemble import RandomForestClassifier: Imports the RandomForestClassifier class from scikit-learn for random forest classification.

scaler = MinMaxScaler(feature_range=(0, 1))

scaler: This is the variable name that refers to the MinMaxScaler object. You can use this variable to access the methods and attributes of the MinMaxScaler object. MinMaxScaler: This is the class name from scikit-learn's preprocessing module. It's used for scaling features to a given range, typically between 0 and 1. feature_range=(0, 1): This parameter specifies the range to which you want to scale your features. In this case, (0, 1) means the features will be scaled to values between 0 and 1. For example, if your original data has values from 0 to 100, after scaling, these values will be transformed to a range from 0 to 1. So, in summary, this line of code creates a MinMaxScaler object named scaler that will be used to normalize (scale) the training data features to a range between 0 and 1. This scaling can help in ensuring that different features with varying scales do not dominate the learning process in machine learning algorithms.

#load and display dataset values dataset = pd.read_csv("Dataset/hpc_io_data.csv") dataset.fillna(0, inplace = True)#replace missing values dataset

pd.read_csv("Dataset/hpc_io_data.csv"): This line uses pandas' read_csv function to read a CSV file named "hpc_io_data.csv" from a directory called "Dataset". The data from the CSV file is loaded into a pandas DataFrame named dataset. dataset.fillna(0, inplace=True): This line replaces any missing values (represented as NaN) in the DataFrame dataset with the value 0. The fillna function is used with the inplace=True parameter to modify the DataFrame in place, meaning the changes are applied directly to the dataset DataFrame without creating a new copy. The overall purpose of these lines is to load a dataset from a CSV file, handle missing values by replacing them with 0, and store the cleaned dataset in the dataset variable. This cleaned dataset can then be used for further analysis, processing, or machine learning tasks.

#find and plot graph of ransomware and benign from dataset where 0 label refers as benign and 1 refer as ransomware #plot labels in dataset labels, count = np.unique(dataset['label'], return_counts = True) labels = ['Benign', 'Ransomware'] height = count bars = labels y_pos = np.arange(len(bars)) plt.bar(y_pos, height) plt.xticks(y_pos, bars) plt.xlabel("Dataset Class Label Graph") plt.ylabel("Count") plt.show()

Certainly! Let's break down the code line by line with each parameter explained:

  1. labels, count = np.unique(dataset['label'], return_counts=True): This line uses NumPy's np.unique function to get unique labels from the 'label' column of the dataset along with their counts. It returns two arrays, one containing the unique labels (labels) and the other containing their respective counts (count).

  2. labels = ['Benign', 'Ransomware']: This line manually sets the labels to be used in the plot. In this case, 'Benign' corresponds to label 0 and 'Ransomware' corresponds to label 1.

  3. height = count: This sets the height of the bars in the bar graph. Each bar's height represents the count of occurrences of each label in the dataset.

  4. bars = labels: This sets the labels for the bars in the bar graph. In this case, 'Benign' and 'Ransomware' are used as labels for the bars.

  5. y_pos = np.arange(len(bars)): This line creates an array of positions for the bars along the y-axis. It uses NumPy's arange function to generate values from 0 to the number of bars (len(bars)), which determines the spacing of the bars in the plot.

  6. plt.bar(y_pos, height): This line creates the bar graph using Matplotlib's bar function. It takes y_pos as the x positions of the bars and height as their corresponding heights.

  7. plt.xticks(y_pos, bars): This sets the labels on the x-axis at the positions specified by y_pos. The labels are set to 'Benign' and 'Ransomware' as defined earlier in the labels array.

  8. plt.xlabel("Dataset Class Label Graph"): This sets the label for the x-axis of the plot, which is the Dataset Class Label Graph in this case.

  9. plt.ylabel("Count"): This sets the label for the y-axis of the plot, representing the count of occurrences of each label.

  10. plt.show(): This line displays the plot in the output, showing the distribution of 'Benign' and 'Ransomware' labels in the dataset as a bar graph.

#dataset preprocessing such as normalization and shuffling data = dataset.values X = data[:,1:data.shape[1]-1] Y = data[:,data.shape[1]-1] Y = Y.astype(int)

indices = np.arange(X.shape[0]) np.random.shuffle(indices)#shuffle dataset values X = X[indices] Y = Y[indices]

scaler = MinMaxScaler((0,1)) X = scaler.fit_transform(X)#normalized or transform features print("Normalized Features") print(X)

Certainly! Let's break down each part of the code snippet you provided:

  1. data = dataset.values: This line extracts the values from the dataset, converting it into a NumPy array named data. Each row of data represents a data point, and each column represents a feature or attribute.

  2. X = data[:,1:data.shape[1]-1]: Here, you're selecting a subset of columns from data to create your feature matrix X. The syntax data[:,1:data.shape[1]-1] means you're taking all rows (:) and columns from index 1 up to (but not including) the last column (data.shape[1]-1). This is typically done to exclude any non-feature columns like IDs or labels.

  3. Y = data[:,data.shape[1]-1]: This line selects the last column of data, which usually contains the target variable or labels, and assigns it to the variable Y.

  4. Y = Y.astype(int): This converts the data type of the elements in Y to integers. This step is often necessary if the labels were read as strings or another data type, ensuring they are integers for modeling purposes.

  5. indices = np.arange(X.shape[0]): This line creates an array of indices from 0 to the number of rows in X. This will be used for shuffling the dataset.

  6. np.random.shuffle(indices): This shuffles the array of indices randomly. This step is crucial for randomizing the order of data points, which helps prevent the model from learning patterns based on the order of the data.

  7. X = X[indices] and Y = Y[indices]: These lines use the shuffled indices to shuffle the rows of X and Y accordingly. Now, the rows of X and Y are in a random order.

  8. scaler = MinMaxScaler((0,1)): This creates a MinMaxScaler object named scaler with a feature range specified from 0 to 1. The MinMaxScaler is used to scale (normalize) the features to a specific range, in this case, between 0 and 1.

  9. X = scaler.fit_transform(X): This line scales (normalizes) the feature matrix X using the MinMaxScaler scaler. It transforms each feature such that its values are within the specified range (0 to 1 in this case).

  10. print("Normalized Features") and print(X): These lines print out a message indicating that the features have been normalized and then print the normalized feature matrix X to the console, showing the transformed values of the features after normalization.

In summary, this code snippet performs dataset preprocessing steps such as extracting features and labels, shuffling the dataset, and normalizing the features using MinMaxScaler. These steps are common in machine learning workflows to prepare the data for training models.

#split dataset into train and test random_seed = 42. X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=random_seed). print(). print("Dataset train & test split as 80% dataset for training and 20% for testing"). print("Training Size (80%): "+str(X_train.shape[0])) #print training and test size. print("Testing Size (20%): "+str(X_test.shape[0])). print().

random_seed = 42: This sets the random seed to 42. Setting a random seed .ensures that the random split generated by train_test_split is reproducible. In other words, running the code multiple times with the same seed will produce the same split. X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=random_seed): This line splits the dataset X and corresponding labels Y into training and testing sets. Here are the parameters explained: X: The feature dataset. Y: The target labels. test_size=0.2: This specifies that 20% of the data will be used for testing, and the remaining 80% will be used for training. random_state=random_seed: This ensures that the split is deterministic, meaning the same split will be produced each time the code is run with the same seed. The print statements are for displaying information about the dataset split: print("Dataset train & test split as 80% dataset for training and 20% for testing"): This line is a general description of the dataset split. print("Training Size (80%): "+str(X_train.shape[0])): This prints the number of samples in the training set, which is 80% of the total dataset. print("Testing Size (20%): "+str(X_test.shape[0])): This prints the number of samples in the testing set, which is 20% of the total dataset. In summary, the code splits the dataset into training and testing sets using an 80-20 split, where 80% of the data is used for training (X_train, y_train), and 20% is used for testing (X_test, y_test). The random_seed ensures reproducibility of the split

def calculateMetrics(algorithm, predict, testY): p = precision_score(testY, predict,average='macro') * 100 r = recall_score(testY, predict,average='macro') * 100 f = f1_score(testY, predict,average='macro') * 100 a = accuracy_score(testY,predict)*100
print() print(algorithm+' Accuracy : '+str(a)) print(algorithm+' Precision : '+str℗) print(algorithm+' Recall : '+str®) print(algorithm+' FMeasure : '+str(f))
accuracy.append(a) precision.append℗ recall.append® fscore.append(f) conf_matrix = confusion_matrix(testY, predict) plt.figure(figsize =(5, 5)) ax = sns.heatmap(conf_matrix, xticklabels = labels, yticklabels = labels, annot = True, cmap="viridis" ,fmt ="g"); ax.set_ylim([0,len(labels)]) plt.title(algorithm+" Confusion matrix") plt.ylabel('True class') plt.xlabel('Predicted class') plt.show()

p = precision_score(testY, predict, average='macro') * 100: Calculates the precision score for the predictions using the true labels (testY). The average='macro' parameter calculates precision for each label and finds their unweighted mean. The result is multiplied by 100 to convert it into a percentage. r = recall_score(testY, predict, average='macro') * 100: Calculates the recall score for the predictions using the true labels (testY). Similar to precision, it calculates the unweighted mean of recall scores for each label and converts the result to a percentage. f = f1_score(testY, predict, average='macro') * 100: Computes the F1 score for the predictions using the true labels (testY). Again, it calculates the unweighted mean of F1 scores for each label and converts the result to a percentage. a = accuracy_score(testY, predict) * 100: Computes the accuracy of the predictions by comparing them to the true labels (testY). It's then multiplied by 100 to get a percentage value. accuracy.append(a), precision.append℗, recall.append®, fscore.append(f): Appends the calculated accuracy, precision, recall, and F1 score to their respective lists. conf_matrix = confusion_matrix(testY, predict): Computes the confusion matrix using the true labels (testY) and the predicted labels (predict). plt.figure(figsize=(5, 5)): Initializes a new figure for plotting the confusion matrix. ax = sns.heatmap(conf_matrix, xticklabels=labels, yticklabels=labels, annot=True, cmap="viridis", fmt="g"): Creates a heatmap of the confusion matrix using seaborn (sns). It annotates the cells with the matrix values and uses the Viridis colormap for visualization. ax.set_ylim([0, len(labels)]): Sets the y-axis limit of the confusion matrix plot based on the number of unique labels. plt.title(algorithm + " Confusion matrix"): Sets the title of the confusion matrix plot, including the algorithm's name. plt.ylabel('True class'): Sets the label for the y-axis of the confusion matrix plot. plt.xlabel('Predicted class'): Sets the label for the x-axis of the confusion matrix plot. plt.show(): Displays the confusion matrix plot.

Let's break down the parameters and their definitions in the given code snippet:

# now train decision tree classifier with hyper parameters
dt_cls = DecisionTreeClassifier(criterion="entropy", max_leaf_nodes=2, max_features="auto")
dt_cls.fit(X_train, y_train)
predict = dt_cls.predict(X_test)
calculateMetrics("Decision Tree", predict, y_test)
  1. DecisionTreeClassifier: This is a class from scikit-learn used for training decision tree models. It can perform both classification and regression tasks.

  2. criterion="entropy": This parameter specifies the criterion used for splitting nodes in the decision tree. "Entropy" is a measure of impurity used to decide the best split.

  3. max_leaf_nodes=2: This parameter sets the maximum number of leaf nodes allowed in the decision tree. It helps control the size of the tree.

  4. max_features="auto": This parameter determines the number of features to consider when looking for the best split. "auto" means that all features will be considered.

  5. X_train: This is the feature matrix of the training data. It contains the input features used to train the decision tree model.

  6. y_train: This is the target vector of the training data. It contains the corresponding labels or target values for the training instances.

  7. X_test: This is the feature matrix of the test data. It contains the input features used to evaluate the trained decision tree model.

  8. y_test: This is the target vector of the test data. It contains the actual labels or target values for the test instances, which will be used to evaluate the model's performance.

  9. dt_cls.fit(X_train, y_train): This line trains the decision tree classifier (dt_cls) using the training data (X_train and y_train).

  10. predict = dt_cls.predict(X_test): This line uses the trained decision tree classifier to make predictions on the test data (X_test). The predicted labels are stored in the variable predict.

  11. calculateMetrics("Decision Tree", predict, y_test): This line calls the calculateMetrics function, passing the algorithm name ("Decision Tree"), the predicted labels (predict), and the actual labels from the test data (y_test) as parameters. This function calculates and prints various evaluation metrics such as accuracy, precision, recall, F1 score, and displays the confusion matrix for the decision tree model's performance evaluation.

#training random Forest algortihm rf = RandomForestClassifier(criterion='gini', max_features="log2", min_weight_fraction_leaf=0.2,max_depth=1.2) rf.fit(X_train, y_train) predict = rf.predict(X_test) calculateMetrics("Random Forest", predict, y_test)

criterion='gini': This parameter specifies the criterion to measure the quality of a split in a Random Forest tree. 'gini' refers to the Gini impurity, which measures how often a randomly chosen element would be incorrectly labeled. max_features="log2": This parameter determines the maximum number of features to consider when looking for the best split in a Random Forest tree. "log2" means the algorithm considers a logarithmic number of features. min_weight_fraction_leaf=0.2: This parameter sets the minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. It helps control overfitting. max_depth=1.2: This parameter sets the maximum depth of each tree in the Random Forest. A higher depth can lead to more complex models, potentially overfitting the training data. However, the value "1.2" doesn't make sense here as max_depth usually takes an integer value representing the maximum depth of the trees.

#train DNN algortihm y_train1 = to_categorical(y_train) y_test1 = to_categorical(y_test) #define DNN object dnn_model = Sequential() #add DNN layers dnn_model.add(Dense(2, input_shape=(X_train.shape[1],), activation='relu')) dnn_model.add(Dense(2, activation='relu')) dnn_model.add(Dropout(0.3)) dnn_model.add(Dense(y_train1.shape[1], activation='softmax'))

compile the keras model

dnn_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) #start training model on train data and perform validation on test data #train and load the model if os.path.exists("model/dnn_weights.hdf5") == False: model_check_point = ModelCheckpoint(filepath='model/dnn_weights.hdf5', verbose = 1, save_best_only = True) hist = dnn_model.fit(X_train, y_train1, batch_size = 32, epochs = 10, validation_data=(X_test, y_test1), callbacks=[model_check_point], verbose=1) f = open('model/dnn_history.pckl', 'wb') pickle.dump(hist.history, f) f.close()
else: dnn_model.load_weights("model/dnn_weights.hdf5") #perform prediction on test data
predict = dnn_model.predict(X_test) predict = np.argmax(predict, axis=1) testY = np.argmax(y_test1, axis=1) calculateMetrics("DNN", predict, testY)#call function to calculate accuracy and other metrics

Let's break down each part with a definition of its parameters:

  1. y_train1 and y_test1: These variables are created by converting the categorical labels y_train and y_test into one-hot encoded format using to_categorical(y_train) and to_categorical(y_test). One-hot encoding converts categorical data into a binary matrix where each label/category is represented as a binary vector.

  2. dnn_model: This is a variable that represents a Deep Neural Network (DNN) model. It's created as a Sequential model in Keras, meaning the layers are stacked sequentially.

  3. dnn_model.add(Dense(2, input_shape=(X_train.shape[1],), activation='relu')): Adds a dense (fully connected) layer to the DNN model. Here, 2 represents the number of neurons in this layer, input_shape=(X_train.shape[1],) defines the input shape based on the number of features in the training data, and activation='relu' sets the activation function to Rectified Linear Unit (ReLU).

  4. dnn_model.add(Dense(2, activation='relu')): Adds another dense layer with 2 neurons and ReLU activation. This is a hidden layer in the neural network.

  5. dnn_model.add(Dropout(0.3)): Adds a dropout layer with a dropout rate of 0.3. Dropout is a regularization technique that randomly sets a fraction of input units to 0 during training to prevent overfitting.

  6. dnn_model.add(Dense(y_train1.shape[1], activation='softmax')): Adds the output layer with neurons equal to the number of classes/categories in the one-hot encoded labels (y_train1). The activation function is set to softmax, which is commonly used for multi-class classification as it gives probabilities for each class.

  7. dnn_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']): Compiles the DNN model with categorical cross-entropy as the loss function (suitable for multi-class classification), the Adam optimizer for optimization, and accuracy as the metric to monitor during training.

  8. model_checkpoint: This is a callback function from Keras (ModelCheckpoint) that saves the best model weights during training based on validation loss or other specified criteria.

  9. Training the model:

    • Checks if the model weights file ('model/dnn_weights.hdf5') exists. If not, it trains the model (dnn_model.fit) on the training data (X_train, y_train1) for 10 epochs with a batch size of 32. It validates the model on the test data (X_test, y_test1) and saves the best weights based on validation accuracy (save_best_only = True).
    • If the weights file already exists, it loads the weights (dnn_model.load_weights) from the file.
  10. predict = dnn_model.predict(X_test): Performs predictions on the test data (X_test) using the trained DNN model.

  11. predict = np.argmax(predict, axis=1): Converts the predicted probabilities into class labels by taking the index of the highest probability along the specified axis (axis=1).

  12. testY = np.argmax(y_test1, axis=1): Converts the one-hot encoded true labels (y_test1) back to class labels for comparison.

  13. calculateMetrics("DNN", predict, testY): Calls the calculateMetrics function to evaluate the performance of the DNN model using predicted labels (predict) and true labels (testY).

#now train LSTM algorithm X_train1 = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1)) X_test1 = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))

lstm_model = Sequential()#defining deep learning sequential object #adding LSTM layer with 100 filters to filter given input X train data to select relevant features lstm_model.add(LSTM(32,input_shape=(X_train1.shape[1], X_train1.shape[2]))) #adding dropout layer to remove irrelevant features lstm_model.add(Dropout(0.2)) #adding another layer lstm_model.add(Dense(32, activation='relu')) #defining output layer for prediction lstm_model.add(Dense(y_train1.shape[1], activation='softmax')) #compile LSTM model lstm_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) #start training model on train data and perform validation on test data #train and load the model if os.path.exists("model/lstm_weights.hdf5") == False: model_check_point = ModelCheckpoint(filepath='model/lstm_weights.hdf5', verbose = 1, save_best_only = True) hist = lstm_model.fit(X_train1, y_train1, batch_size = 32, epochs = 10, validation_data=(X_test1, y_test1), callbacks=[model_check_point], verbose=1) f = open('model/lstm_history.pckl', 'wb') pickle.dump(hist.history, f) f.close()
else: lstm_model.load_weights("model/lstm_weights.hdf5") #perform prediction on test data
predict = lstm_model.predict(X_test1) predict = np.argmax(predict, axis=1) testY = np.argmax(y_test1, axis=1) calculateMetrics("LSTM", predict, testY)#call function to calculate accuracy and other metrics

Certainly! Let's break down each parameter and the operations within the given code:

  1. X_train1: This variable represents the reshaped training data for the LSTM model. It reshapes the input data to have a third dimension of size 1, which is necessary for LSTM layers in Keras. The shape of X_train1 is (number of samples, time steps, features).

  2. X_train.shape[0]: This is the number of samples in the training data.

  3. X_train.shape[1]: This represents the number of time steps in the training data.

  4. X_test1: Similar to X_train1, this variable represents the reshaped test data for the LSTM model.

  5. X_test.shape[0]: Number of samples in the test data.

  6. X_test.shape[1]: Number of time steps in the test data.

  7. lstm_model: This variable represents the Sequential model for the LSTM algorithm. It's a deep learning model where layers are added sequentially.

  8. Sequential(): Creates a new sequential model in Keras.

  9. lstm_model.add(LSTM(32, input_shape=(X_train1.shape[1], X_train1.shape[2]))): Adds an LSTM layer to the model with 32 units (neurons). The input_shape parameter specifies the input shape, which is (time steps, features).

  10. lstm_model.add(Dropout(0.2)): Adds a dropout layer with a dropout rate of 0.2. Dropout helps in regularization by randomly setting a fraction of input units to 0 during training to prevent overfitting.

  11. lstm_model.add(Dense(32, activation='relu')): Adds a dense layer (fully connected layer) with 32 units and ReLU activation function.

  12. lstm_model.add(Dense(y_train1.shape[1], activation='softmax')): Adds the output layer with the number of units equal to the number of classes in the target variable (y_train1). Softmax activation is used for multi-class classification to output probabilities for each class.

  13. lstm_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']): Compiles the LSTM model with categorical cross-entropy loss (suitable for multi-class classification), Adam optimizer, and accuracy as the metric to monitor during training.

  14. model_check_point: This variable is used to create a ModelCheckpoint callback, which saves the model weights during training if certain conditions are met (e.g., improvement in validation loss).

  15. hist = lstm_model.fit(...): Trains the LSTM model on the training data (X_train1, y_train1) for a specified number of epochs (10 in this case) with batch size 32. Validation data (X_test1, y_test1) is used to validate the model's performance during training. The training history (hist) is stored for later analysis.

  16. pickle.dump(hist.history, f): Saves the training history (loss and accuracy per epoch) to a pickle file for future reference.

  17. lstm_model.predict(X_test1): Performs predictions on the test data (X_test1).

  18. predict = np.argmax(predict, axis=1): Converts the predicted probabilities into class labels by selecting the class with the highest probability.

  19. testY = np.argmax(y_test1, axis=1): Converts the true labels (y_test1) from one-hot encoded format to class labels.

  20. calculateMetrics("LSTM", predict, testY): Calls the calculateMetrics function to evaluate the performance of the LSTM model using predicted and true labels.

Define regularization strength

dropout_rate = 0.2 # Adjust this value as needed

Reshape the data

X_train1 = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1, 1)) X_test1 = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1, 1))

Define the CNN model

cnn_model = Sequential()

Add a CNN layer with 64 filters

cnn_model.add(Convolution2D(64, (1, 1), input_shape=(X_train1.shape[1], X_train1.shape[2], X_train1.shape[3]), activation='relu'))

Add MaxPooling layer

cnn_model.add(MaxPooling2D(pool_size=(1, 1)))

Add another CNN layer with 32 filters

cnn_model.add(Convolution2D(32, (1, 1), activation='relu'))

Add MaxPooling layer

cnn_model.add(MaxPooling2D(pool_size=(1, 1)))

Flatten the output

cnn_model.add(Flatten())

Add Dropout regularization

cnn_model.add(Dropout(dropout_rate))

Add Dense layers

cnn_model.add(Dense(units=256, activation='relu')) cnn_model.add(Dense(units=y_train1.shape[1], activation='softmax'))

Compile the model

cnn_model.compile(optimizer='Adagrad', loss='categorical_crossentropy', metrics=['accuracy'])

Train and save the model

if not os.path.exists("model/cnn_weights.hdf5"): model_check_point = ModelCheckpoint(filepath='model/cnn_weights.hdf5', verbose=1, save_best_only=True) hist = cnn_model.fit(X_train1, y_train1, batch_size=40, epochs=800, validation_data=(X_test1, y_test1), callbacks=[model_check_point], verbose=1) with open('model/cnn_history.pckl', 'wb') as f: pickle.dump(hist.history, f) else: cnn_model.load_weights("model/cnn_weights.hdf5")

Perform prediction on test data

predict = cnn_model.predict(X_test1) predict = np.argmax(predict, axis=1) testY = np.argmax(y_test1, axis=1) calculateMetrics("CNN2D", predict, testY)

Let's break down each line of the code and explain its purpose, along with the definitions of the parameters used:

  1. dropout_rate = 0.2: This line defines the dropout rate, which is a regularization technique used to prevent overfitting in neural networks. It randomly sets a fraction of input units to 0 during training to reduce dependency on specific features.

  2. X_train1 = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1, 1)): Reshapes the training data (X_train) to have a shape suitable for the CNN model. The shape becomes (number of samples, time steps, 1, 1), where the last two dimensions correspond to channels (assuming grayscale images).

  3. X_test1 = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1, 1)): Reshapes the test data (X_test) in the same way as the training data.

  4. cnn_model = Sequential(): Creates a Sequential model in Keras for the CNN architecture. Layers will be added sequentially to this model.

  5. cnn_model.add(Convolution2D(64, (1, 1), input_shape=(X_train1.shape[1], X_train1.shape[2], X_train1.shape[3]), activation='relu')): Adds a 2D convolutional layer with 64 filters of size 1x1. The input shape is specified based on the reshaped data. ReLU activation is used.

  6. cnn_model.add(MaxPooling2D(pool_size=(1, 1))): Adds a max-pooling layer with a pool size of 1x1. Max-pooling reduces the spatial dimensions of the input.

  7. cnn_model.add(Convolution2D(32, (1, 1), activation='relu')): Adds another 2D convolutional layer with 32 filters of size 1x1. ReLU activation is used.

  8. cnn_model.add(MaxPooling2D(pool_size=(1, 1))): Adds another max-pooling layer with a pool size of 1x1.

  9. cnn_model.add(Flatten()): Flattens the output from the convolutional layers into a 1D array for input to the dense layers.

  10. cnn_model.add(Dropout(dropout_rate)): Adds a dropout layer with the specified dropout rate to prevent overfitting.

  11. cnn_model.add(Dense(units=256, activation='relu')): Adds a dense layer with 256 units and ReLU activation.

  12. cnn_model.add(Dense(units=y_train1.shape[1], activation='softmax')): Adds the output layer with units equal to the number of classes in the target variable. Softmax activation is used for multi-class classification to output class probabilities.

  13. cnn_model.compile(optimizer='Adagrad', loss='categorical_crossentropy', metrics=['accuracy']): Compiles the CNN model with the Adagrad optimizer, categorical cross-entropy loss (for multi-class classification), and accuracy as the metric to monitor during training.

  14. Training and Saving the Model:

  1. cnn_model.load_weights("model/cnn_weights.hdf5"): Loads the saved weights of the CNN model if the weights file exists.

  2. predict = cnn_model.predict(X_test1): Performs predictions on the test data (X_test1).

  3. predict = np.argmax(predict, axis=1): Converts the predicted probabilities into class labels by selecting the class with the highest probability.

  4. testY = np.argmax(y_test1, axis=1): Converts the true labels (y_test1) from one-hot encoded format to class labels.

  5. calculateMetrics("CNN2D", predict, testY): Calls the calculateMetrics function to evaluate the performance of the CNN model using predicted and true labels.