Fraud Detection Project

Kwnstantinos Lambrou
Sep 4, 2024
2 min read

Updated: Jan 21

Project Overview

Welcome to my project on using unsupervised learning techniques for fraud detection in credit card transactions. This project aims to address the challenge of identifying fraudulent activities within a highly imbalanced dataset, ensuring that such activities can be prevented effectively.

Project Goals

Detect and Prevent Fraud: Leverage machine learning to identify and prevent potential fraudulent transactions.
Handle Imbalanced Data: Implement techniques to effectively work with highly skewed datasets where fraudulent transactions are significantly fewer than legitimate ones.
Enhance Security Measures: Provide a robust model that enhances security measures for financial transactions.

Technical Description

The project utilizes a dataset containing credit card transactions labeled as either normal or fraudulent. Each transaction is described by features obtained through a PCA transformation for confidentiality, including 'Time' and 'Amount' of the transaction.

Data Preprocessing: Initial steps include loading the data, performing exploratory data analysis to understand the structure, and ensuring there are no missing values. The dataset is found to be highly imbalanced with a majority of legitimate transactions.
Exploratory Data Analysis (EDA): Visualization techniques, such as count plots, help visualize the imbalance in the dataset. Summary statistics are generated to compare the properties of fraudulent and legitimate transactions.
Data Sampling: To address the imbalance, an under-sampling method is employed. A subset of the data containing a balanced number of fraudulent and legitimate transactions is created for model training.
Model Building and Evaluation:
- Logistic Regression Model: A logistic regression model is chosen for its efficiency and interpretability. The model is trained on the balanced dataset.
- Performance Evaluation: Model performance is evaluated using metrics such as accuracy, precision, recall, and F1-score. Additionally, a confusion matrix is employed to visualize the model's performance in classifying the transactions.
Results:
- The logistic regression model achieves an accuracy of approximately 95.4%, indicating its effectiveness in distinguishing between fraudulent and legitimate transactions.
- The precision-recall values suggest that the model is reliable in predicting both types of transactions, with a slightly higher effectiveness in identifying fraudulent activities.

Conclusion

This project demonstrates the capability of machine learning models to identify fraudulent transactions within highly imbalanced datasets. It highlights the importance of choosing the right sampling techniques and metrics for evaluating model performance in real-world scenarios.

View the Code

For a detailed look at the code and to explore the Jupyter notebook, please visit my GitHub repository.