Fraud Detection
Research Items
- Data Sets
- Relevant Papers
- Available Solutions
- Machine learning
- Pre-processing
- Features analysis
- Modelling
- Evaluation
- Suggested Solution
1. Data Sets
The first step is to find fraud data sets for modeling purposes. Unfortunately, fraud data sets are really difficult to find publicly because of the confidential information that they contain. Listed below are some of the data sets found and notes on them.
Real world data set from Kaggle (ULB)
https://www.kaggle.com/mlg-ulb/creditcardfraud
The data set contains labelled credit card transactions labeled as fraudulent or genuine. Unfortunately, the column labels do not make sense because PCA has been applied for dimensional reduction. Therefore, it is very difficult to understand what each of the columns represent. Nonetheless, it is real world data. The data sets contains transactions made by credit cards in September 2013 by European cardholders. This data set presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The data set is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
This data set has been analysed and models have been created below in the document, with F1 score of 94%.
Synthetic data set from Kaggle (NTNU)
https://www.kaggle.com/ntnu-testimon/paysim1
The data set contains synthetic (created) transaction data. The advantage of using this data set is that PCA has not been pre-performed, thus allowing extraction of all useful information. However, the data set is scaled down to 1/4th of the original data set.
2. Papers:
Link | Title | Summary |
---|---|---|
https://www.aaai.org/Papers/KDD/1998/KDD98-026.pdf | Toward Scalable Learning with Non-uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection | Handling skewed datasets, compares credit card fraud detection models, and evaluate how the different sets of features have an impact on the results with the help of a real credit card fraud dataset provided by a large European card processing company (the dataset above). The results show an average increase in savings of 13% by including the proposed periodic features into the methods. Using 50-50 split in fraud, non-fraud leads to better models. von Mises distribution: https://en.wikipedia.org/wiki/Von_Mises_distribution |
https://www.aaai.org/Papers/Workshops/1997/WS-97-07/WS97-07-015.pdf | Credit Card Fraud Detection using Meta-Learning: Issues and Initials Results | Apart from the finding like above (using balanced training), the paper talks about using metrics other than accuracy for model evaluation. |
http://journal.utem.edu.my/index.php/jtec/article/view/3571/2466 | Credit Card Fraud Detection Using Machine Learning As Data Mining Technique | 95% accuracy based on Naive based derivatives. |
https://www.jair.org/index.php/jair/article/view/10302/24590 | SMOTE: Synthetic Minority Over-sampling Technique | Using SMOTE method as described in the paper is another alternative of getting around the skewness problem. |
3. Research on available solutions:
Airbnb https://medium.com/airbnb-engineering/architecting-a-machine-learning-system-for-risk-941abbba5a60
Ways to mitigate potential bad actors to carry out different types of attacks:
- Product changes: 2FA, email verification, etc etc
- Anomaly detection: Scripted attacks that can cause anomaly
- heuristics/machine learning model based on different factors
Framework
- Fast and robust
- Agile (catch up game)
PMML: Predictive model markup language Openscoring: encodes several common types of machine learning models
They do not provide fraud detection as a service.
Paypal
Paypal recently acquired Simility for fraud detection.
Simility looks at various session, device, and behavioral bio-metrics and builds a profile for what constitutes “normal” user login behavior; if an anomaly is spotted, it can act to prevent the action.
https://www.dropbox.com/s/ft3wu5ix15xukhc/Mobile%20Fintech%20Fraud.pdf?dl=0
Stripe
Even if a card is new to your business, there’s an 89% chance it’s been seen before on the Stripe network.
4. Machine Learning
- Supervised learning was applied to the PCA data set discussed in the data sets section.
- Different ensemble machine learning algorithms were tested, rather than using one particular algorithm for modelling.
- Metrics like Precision, Recall, F1 score were used to evaluate the model and get a better understanding of True Positives, True Negatives, False Positive and False Negatives.
Results:
Random Forest
Accuracy score for Random Forest : 0.9538461538461539
Precision score Random Forest : 0.98
Recall score Random Forest : 0.9245283018867925
F1 score Random Forest : 0.9514563106796116
Bagging
Accuracy score for Bagging : 0.963076923076923
Precision score Bagging : 0.9867549668874173
Recall score Bagging : 0.9371069182389937
F1 score Bagging : 0.9612903225806452
AdaBoost
Accuracy score for Ada Boost Classifier : 0.9446153846153846
Precision score Ada Boost Classifier : 0.9795918367346939
Recall score Ada Boost Classifier : 0.9056603773584906
F1 score Ada Boost Classifier : 0.9411764705882353
False Positives vs False Negatives