Fraud Detection

Research Items

  • Data Sets
  • Relevant Papers
  • Available Solutions
  • Machine learning
    • Pre-processing
    • Features analysis
    • Modelling
    • Evaluation
  • Suggested Solution

1. Data Sets

The first step is to find fraud data sets for modeling purposes. Unfortunately, fraud data sets are really difficult to find publicly because of the confidential information that they contain. Listed below are some of the data sets found and notes on them.

Real world data set from Kaggle (ULB)

https://www.kaggle.com/mlg-ulb/creditcardfraud

The data set contains labelled credit card transactions labeled as fraudulent or genuine. Unfortunately, the column labels do not make sense because PCA has been applied for dimensional reduction. Therefore, it is very difficult to understand what each of the columns represent. Nonetheless, it is real world data. The data sets contains transactions made by credit cards in September 2013 by European cardholders. This data set presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The data set is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

This data set has been analysed and models have been created below in the document, with F1 score of 94%.

Synthetic data set from Kaggle (NTNU)

https://www.kaggle.com/ntnu-testimon/paysim1

The data set contains synthetic (created) transaction data. The advantage of using this data set is that PCA has not been pre-performed, thus allowing extraction of all useful information. However, the data set is scaled down to 1/4th of the original data set.

2. Papers:

Link Title Summary
https://www.aaai.org/Papers/KDD/1998/KDD98-026.pdf Toward Scalable Learning with Non-uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection Handling skewed datasets, compares credit card fraud detection models, and evaluate how the different sets of features have an impact on the results with the help of a real credit card fraud dataset provided by a large European card processing company (the dataset above). The results show an average increase in savings of 13% by including the proposed periodic features into the methods.

Using 50-50 split in fraud, non-fraud leads to better models.

von Mises distribution: https://en.wikipedia.org/wiki/Von_Mises_distribution
https://www.aaai.org/Papers/Workshops/1997/WS-97-07/WS97-07-015.pdf Credit Card Fraud Detection using Meta-Learning: Issues and Initials Results Apart from the finding like above (using balanced training), the paper talks about using metrics other than accuracy for model evaluation.
http://journal.utem.edu.my/index.php/jtec/article/view/3571/2466 Credit Card Fraud Detection Using Machine Learning As Data Mining Technique 95% accuracy based on Naive based derivatives.
https://www.jair.org/index.php/jair/article/view/10302/24590 SMOTE: Synthetic Minority Over-sampling Technique Using SMOTE method as described in the paper is another alternative of getting around the skewness problem.

3. Research on available solutions:

Airbnb https://medium.com/airbnb-engineering/architecting-a-machine-learning-system-for-risk-941abbba5a60

Ways to mitigate potential bad actors to carry out different types of attacks:

  • Product changes: 2FA, email verification, etc etc
  • Anomaly detection: Scripted attacks that can cause anomaly
  • heuristics/machine learning model based on different factors

Framework

  • Fast and robust
  • Agile (catch up game)

PMML: Predictive model markup language Openscoring: encodes several common types of machine learning models

They do not provide fraud detection as a service.

Paypal

https://venturebeat.com/2018/06/21/paypal-to-acquire-machine-learning-powered-fraud-detection-startup-simility/

Paypal recently acquired Simility for fraud detection.

Simility looks at various session, device, and behavioral bio-metrics and builds a profile for what constitutes “normal” user login behavior; if an anomaly is spotted, it can act to prevent the action.

https://www.dropbox.com/s/ft3wu5ix15xukhc/Mobile%20Fintech%20Fraud.pdf?dl=0

Stripe

https://stripe.com/us/radar

Even if a card is new to your business, there’s an 89% chance it’s been seen before on the Stripe network.

4. Machine Learning

  • Supervised learning was applied to the PCA data set discussed in the data sets section.
  • Different ensemble machine learning algorithms were tested, rather than using one particular algorithm for modelling.
  • Metrics like Precision, Recall, F1 score were used to evaluate the model and get a better understanding of True Positives, True Negatives, False Positive and False Negatives.

Link to complete notebook

Results:

Random Forest

Accuracy score for Random Forest : 0.9538461538461539
Precision score Random Forest : 0.98
Recall score Random Forest : 0.9245283018867925
F1 score Random Forest : 0.9514563106796116

Bagging

Accuracy score for Bagging : 0.963076923076923
Precision score Bagging : 0.9867549668874173
Recall score Bagging : 0.9371069182389937
F1 score Bagging : 0.9612903225806452

AdaBoost

Accuracy score for Ada Boost Classifier : 0.9446153846153846
Precision score Ada Boost Classifier : 0.9795918367346939
Recall score Ada Boost Classifier : 0.9056603773584906
F1 score Ada Boost Classifier : 0.9411764705882353

False Positives vs False Negatives