Microsoft Malware Prediction

7 min readJul 19, 2023

Are you excited about starting a machine learning project but need guidance on how to proceed?

Then this beginner-friendly project is perfect for you. It offers a detailed step-by-step guide and utilizes the Malware Prediction dataset from Kaggle. To ensure efficient memory usage, I have extracted around 500,000 rows from the original dataset, which contains over 8 million rows and 60+ columns. Read the article thoroughly to learn how I executed this project.

Step 1: Downloading the Dataset

To download the dataset into the Colab notebook, you can utilize the opendatasetslibrary provided by ‘Jovian’. This library simplifies the process of downloading the dataset. By importing the library and providing your Kaggle credentials, you can easily download the dataset.

Afterwards, you can load the dataset using pandas and specify the desired number of rows to be loaded.

Step 2: Minimizing the storage

The data types int16, int32, int64 can be converted to int8 while float64 can be converted to float32 for minimizing the storage. In this way, much space efficiency can be achieved.

Step 3: Data Cleaning and Preprocessing

Data cleaning is crucial in machine learning to ensure the accuracy and reliability of the models. It helps remove errors, outliers, and irrelevant data, leading to more accurate predictions. By addressing biases and ensuring data consistency, data cleaning improves the fairness and effectiveness of the models. Additionally, it handles missing values and simplifies the model, preventing overfitting and enhancing generalization on unseen data.

To begin the process, a data frame is created that includes columns such as unique values, Percent of missing values, Percent of most common values, etc. This data frame serves as a helpful tool for data cleaning and enables the removal of relevant rows and columns. In this case, columns with more than 95% missing values and columns with more than 90% most common values are removed. This filtering leaves us with a reduced dataset containing only 54 columns for further analysis.

Moving forward, the null values in the dataset are addressed by considering individual columns. Each column is examined, and null values are filled using mean, median, mode, or ‘unknown’ depending on the appropriate approach. This step is performed for columns where the null values exceed 3% of the total. For the remaining dataset, any rows containing null values are removed entirely.

Data Preprocessing

In the next step, the data type of each column is examined and optimized accordingly. Columns with unique values less than 20 are converted to the ‘category’ datatype. Columns that represent identity numbers or similar identifiers are considered as objects. The remaining columns are converted to the numeric datatype. This process ensures that each column is appropriately represented by the most suitable datatype, optimizing the data structure and improving computational efficiency.

Step 4: Exploratory Data Analysis

Indeed, analyzing the data in relation to the target column is a crucial step in machine learning. In your project, you performed this analysis by creating various graphs using Python libraries such as Seaborn and matplotlib.

These graphs provide visual representations of how different columns are interconnected. By visualizing the data in this manner, you gain insights into the relationships and patterns within the dataset, which can aid in feature selection, identifying important variables, and understanding the overall data structure.

Some of the insights are:

Step 3: Feature Engineering

Feature engineering is a crucial step in machine learning projects where raw data is transformed into a suitable format for training models. Transformations like scaling numerical features and one-hot encoding will be performed. By engineering features effectively, we can extract valuable information and uncover patterns within the data, leading to improved model performance and more accurate predictions.

To split the data, the train_test_split module from scikit-learn is utilized. This module allows you to divide the dataset into training and testing sets. Subsequently, the input and target columns are distinguished, and new data frames are created accordingly.

After the data is split, scaling is applied to the numerical features. This process involves transforming each numerical feature to a range of 0 to 1. Scaling helps ensure that all numerical features have a consistent scale and prevents features with larger values from dominating the model.

Furthermore, one-hot encoding is performed on categorical features. This encoding technique converts categorical variables into a numerical format. It creates a new column for each unique value in a specific column, representing the presence or absence of that value in the original column. This enables the model to interpret categorical data as numeric, facilitating further analysis and modelling.

Step 5: Machine-Learning Models

Baseline Model

In order to establish a baseline for comparison, two models are created. The first baseline model predicts the most frequently occurring outcome as the prediction for all instances. This provides a benchmark to evaluate the performance of more advanced models.

Additionally, a second baseline model is developed that generates random predictions between 1 and 0. Since this is a classification problem, the random guess model assigns either 1 or 0 as the predicted outcome. This model helps determine if the other models perform better than random chance.

By comparing the performance of the actual models against these baseline models, you can assess the effectiveness and improvement achieved by the more sophisticated machine learning models.

Logistic Regression Model

To surpass the performance of the baseline models, logistic regression from the sci-kit-learn library is employed. Logistic regression is a commonly used algorithm for binary classification tasks. By fitting the logistic regression model to the training data, it learns the underlying patterns and relationships within the dataset.

After training the model, its performance is evaluated on the test set. In this case, an accuracy score of 62.28% is obtained. This score reflects the proportion of correctly predicted outcomes by the logistic regression model.

Comparing this accuracy score with the performance of the baseline models allows you to assess the improvement achieved by the logistic regression model.

XG Boost Model

XGBoost is a highly effective and efficient machine learning algorithm widely utilized for creating strong ensemble models. By combining multiple weak prediction models, typically decision trees, XGBoost achieves enhanced accuracy and mitigates overfitting issues through gradient boosting and regularization techniques.

By applying the XGBoost model to your dataset, you were able to achieve an accuracy of 62.16%. This accuracy score reflects the proportion of correctly predicted outcomes by the model on the test set. It indicates the overall performance of the model in terms of correctly classifying instances.

Hyperparameter Tuning

Just like other machine learning models, there are several hyperparameters we can to adjust the capacity of the model and reduce overfitting.

We employed the XGBoost classifier with the following hyperparameters:

max_depth: 5
learning_rate: 0.1
alpha: 10

The model achieved a training accuracy of 0.65, indicating its ability to capture patterns and perform well on the training data. However, we observed a validation accuracy of 0.63.

Conclusion

In conclusion, completing a Machine Learning Project is an exciting journey that involves data exploration, preprocessing, model development, and evaluation. Throughout this project, I have gained valuable insights into the dataset, applied various techniques to clean and transform the data, and implemented powerful machine learning algorithms like XGBoost.

By following a step-by-step approach and utilizing libraries such as sci-kit-learn and pandas, I was able to build and evaluate models, surpassing the baseline performance and achieving an accuracy of 62.16% with the XGBoost classifier.

Undertaking this project has not only deepened my understanding of machine learning concepts but also equipped me with practical experience in tackling real-world data analysis tasks. I look forward to applying this knowledge in future projects and exploring more advanced techniques in the field of machine learning.

I hope this article has provided a valuable guide for beginners embarking on their own machine-learning projects. By following the outlined steps and incorporating your own creativity and problem-solving skills, you can embark on a rewarding journey of developing powerful machine-learning models and extracting meaningful insights from data.

Thank you for joining me on this project, and I wish you the best of luck in your own machine-learning endeavours!

Do not forget to look into this notebook for the complete project

https://jovian.com/bhavyab1375/machine-learning-project