Synthetic data based on pareto principle

Nowadays, the phrase “data is the new oil” seems a little cliche. However, it is true. The demand for data has grown significantly, as people and businesses have realized the possibilities of leveraging data for informed decision making. Often, it is not possible to get data to our specific needs. As a ML teacher, I have heard my students claim that it is really challenging for them to find data to their specifications for several of the projects that they want to pursue (in this case for learning data science or for school projects). In other cases, where confidentiality is crucial, the possibility of open data is minimal (anonamized data is there but that is another difficult avenue). Synthetic data can help solve these issues. Synthetic data is “any production data applicable to a given situation that are not obtained by direct measurement” according to the McGraw-Hill Dictionary of Scientific and Technical Terms. However, synthetic does not imply random. The patterns in synthetic data should mimic real world patterns. This is a priliminary exercise.

Pareto

Creating a synthetic dataset for users and users transactions that demonstrate the Pareto principle. Compulsory fields include basic demographic information, email address, phone number, item number and price.

Imports

import pandas as pd
import numpy as np, numpy.random
import random
import matplotlib.pyplot as plt

Brute Force Solution

Brute Force Assumption: All ticket size the same, Customers only purchase once.

Create a naive list for users

This will be modified to look real soon. In real world case, this can be thought of user/client table with incremental id.

# variables
final_dataframe_size = 1000 # in this case, these are the 1000 customers who have bought something.
# Generate users
users = [x for x in range (1000, 10000)]
users[:10]
[1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009]
# shuffle the original list
random.shuffle(users)
users[:10]
[8815, 2924, 6563, 3568, 3604, 6821, 1466, 9813, 3153, 6637]

Sampling random 1000 users who have transacted. This is because of the assumption that not every user will have transacted.

tran_users = random.sample(users, final_dataframe_size)
print (len(tran_users))
print (tran_users[:10])
1000
[6113, 5352, 4711, 2600, 5643, 3009, 7765, 7301, 1399, 2336]
pareto_tran_users = random.sample(tran_users, int(.2 * final_dataframe_size))
len(pareto_tran_users)
200
df = pd.DataFrame({'user': tran_users})
df.head()

user
0 6113
1 5352
2 4711
3 2600
4 5643

If the sum of all transactions is 100000,

pareto users = 0.8*100000 = 80000 to be divided into 200 users, each user gets = 400

non pareto users = .2*100000 = 20000 to be divided into 800 users, each user gets = 25

def assign_amount(val):
    if val in pareto_tran_users:
        return 400
    return 25
     
df['transaction'] = df.user.apply(assign_amount)
# validate pareto
df.sort_values('transaction', ascending=False).head(int(.2 * final_dataframe_size)).sum()/df.sum()
user           0.20033
transaction    0.80000
dtype: float64

Here we see 20 percen generating 80 percent of the revenue.

Improvement 1

Make the sales figure more dynamic (instead of 25 and 400)

def sum_to_x(n, x):
    values = [0.0, x] + list(np.random.uniform(low=0.0,high=x,size=n-1))
    values.sort()
    return [values[i+1] - values[i] for i in range(n)]
df2 = df.copy()
df2_p = df2[df2.user.isin(pareto_tran_users)]
df2_p_n = df2[~df2.user.isin(pareto_tran_users)]
# .apply(sum_to_x(20000, 80000))
df2_p['transaction'] = sum_to_x(200, 80000)
df2_p_n['transaction'] = sum_to_x(800, 20000)
df2 = df2_p.append(df2_p_n)
df2

user transaction
4 5643 271.252382
16 1403 64.145236
20 2671 150.894671
27 1881 3.295457
29 4574 129.600777
... ... ...
991 7792 58.155219
992 4682 18.239724
994 5986 4.711870
996 8145 0.914484
998 4102 20.064127

1000 rows × 2 columns

df2.sum()
user           5553121.0
transaction     100000.0
dtype: float64
# validate pareto
df2.sort_values('transaction', ascending=False).head(200).sum().apply(lambda x: '%.3f' % x)
user           1136745.000
transaction      82287.864
dtype: object

Upload some data

To make the data more realistic, we can add more features. In this case, mockaroo was used to create the features.

df_users = pd.read_csv('synthetic_users.csv')
df_users.head()

id first_name last_name email gender
0 1 Pierette Deners pdeners0@1und1.de Female
1 2 Gage Opdenorth gopdenorth1@hugedomains.com Male
2 3 Raquel Gabriel rgabriel2@studiopress.com Female
3 4 Matti Sullly msullly3@soundcloud.com Female
4 5 Zita Doggart zdoggart4@ftc.gov Female

The client table seems complete. Let us create an item table which follows normal distribution.

df_items = pd.read_csv('synthetic_items_catalog.csv')
df_items.drop_duplicates(subset=['product'], keep='first', inplace=True)
df_items['id'] = df_items['id'] + 1000
df_items

id product
0 1001 Bread Ww Cluster
1 1002 Glove - Cutting
2 1003 Ice Cream - Super Sandwich
3 1004 Wine - Chateau Bonnet
4 1005 Ecolab - Ster Bac
... ... ...
992 1993 Table Cloth 53x53 White
993 1994 Durian Fruit
995 1996 Chinese Foods - Plain Fried Rice
996 1997 Wine - White, Ej Gallo
998 1999 Sugar - Fine

836 rows × 2 columns

prices = np.random.normal(50, 20, 836)
df_items['prices'] = prices
df_items['prices'] = np.where(df_items['prices'] < 0 , 2.0, df_items['prices'])
df_items['prices'] = round(df_items['prices'])
df_items.describe()

id prices
count 836.000000 836.000000
mean 1469.366029 50.340909
std 285.036373 20.400653
min 1001.000000 2.000000
25% 1220.500000 36.000000
50% 1460.500000 50.000000
75% 1710.250000 64.000000
max 1999.000000 120.000000

Now we have users and product table. The product table consists of items that are uniformly distributed.

np.random.normal(0, 1, 10)
array([ 0.7365823 ,  0.18626006, -0.10850225, -0.8307533 ,  0.18490917,
        0.51846682, -1.20012347,  0.31953487, -0.12546108,  0.75146225])
## Could not figure out how to generate normal distribution frame, 
## so opting out for uniform distribution
# weights_users = np.random.normal(1000, 1000, len(df_users.id))
weights_items = np.random.normal(1000, 1000, len(df_items.id))

tran_user = random.choices(random.sample(list(df_users.id), 268), k=10000)
tran_items = random.choices(list(df_items.id), weights = weights_items, k=10000)
df = pd.DataFrame.from_dict({'user_id': tran_user, 'item_id': tran_items})\
.set_index('item_id')\
.join(df_items.set_index('id'))


df_ = df.groupby('user_id')[['prices']].sum()
df_.describe()

prices
count 268.000000
mean 1889.985075
std 343.707196
min 1161.000000
25% 1649.750000
50% 1896.500000
75% 2114.500000
max 2860.000000
# validate pareto
# regardless of transaction table, hamro users 200 jana
df_.reset_index(inplace=True)
(df_.sort_values('prices', ascending=False).head(200).sum())/df_.sum()
user_id    0.732714
prices     0.803493
dtype: float64
merged = df.reset_index()\
.set_index('user_id')\
.join(df_users.set_index('id'), how='outer')\
.reset_index()\
.rename(columns={'index':'product_id', 'level_0': 'user_id'})
merged

user_id product_id product prices first_name last_name email gender
0 1 1013.0 Shrimp - Black Tiger 13/15 78.0 Pierette Deners pdeners0@1und1.de Female
1 1 1025.0 Longos - Lasagna Veg 48.0 Pierette Deners pdeners0@1und1.de Female
2 1 1032.0 Wine - Pinot Grigio Collavini 54.0 Pierette Deners pdeners0@1und1.de Female
3 1 1067.0 Shiro Miso 85.0 Pierette Deners pdeners0@1und1.de Female
4 1 1073.0 Pea - Snow 69.0 Pierette Deners pdeners0@1und1.de Female
... ... ... ... ... ... ... ... ...
10727 1000 1866.0 Figs 56.0 Hillery Dickings hdickingsrr@gizmodo.com Male
10728 1000 1902.0 Wine - Harrow Estates, Vidal 62.0 Hillery Dickings hdickingsrr@gizmodo.com Male
10729 1000 1924.0 Pancetta 25.0 Hillery Dickings hdickingsrr@gizmodo.com Male
10730 1000 1933.0 Vodka - Smirnoff 23.0 Hillery Dickings hdickingsrr@gizmodo.com Male
10731 1000 1994.0 Durian Fruit 51.0 Hillery Dickings hdickingsrr@gizmodo.com Male

10732 rows × 8 columns

merged.groupby('user_id')[['prices']].sum().sort_values('prices', ascending=False).head(200).sum()/merged['prices'].sum()
prices    0.803493
dtype: float64
merged.to_csv('merged.csv', index=False)
df_users.to_csv('users.csv', index=False)
df_items.to_csv('items.csv', index=False)
merged

user_id product_id product prices first_name last_name email gender
0 1 1013.0 Shrimp - Black Tiger 13/15 78.0 Pierette Deners pdeners0@1und1.de Female
1 1 1025.0 Longos - Lasagna Veg 48.0 Pierette Deners pdeners0@1und1.de Female
2 1 1032.0 Wine - Pinot Grigio Collavini 54.0 Pierette Deners pdeners0@1und1.de Female
3 1 1067.0 Shiro Miso 85.0 Pierette Deners pdeners0@1und1.de Female
4 1 1073.0 Pea - Snow 69.0 Pierette Deners pdeners0@1und1.de Female
... ... ... ... ... ... ... ... ...
10727 1000 1866.0 Figs 56.0 Hillery Dickings hdickingsrr@gizmodo.com Male
10728 1000 1902.0 Wine - Harrow Estates, Vidal 62.0 Hillery Dickings hdickingsrr@gizmodo.com Male
10729 1000 1924.0 Pancetta 25.0 Hillery Dickings hdickingsrr@gizmodo.com Male
10730 1000 1933.0 Vodka - Smirnoff 23.0 Hillery Dickings hdickingsrr@gizmodo.com Male
10731 1000 1994.0 Durian Fruit 51.0 Hillery Dickings hdickingsrr@gizmodo.com Male

10732 rows × 8 columns

df_users

id first_name last_name email gender
0 1 Pierette Deners pdeners0@1und1.de Female
1 2 Gage Opdenorth gopdenorth1@hugedomains.com Male
2 3 Raquel Gabriel rgabriel2@studiopress.com Female
3 4 Matti Sullly msullly3@soundcloud.com Female
4 5 Zita Doggart zdoggart4@ftc.gov Female
... ... ... ... ... ...
995 996 Web Beauly wbeaulyrn@drupal.org Male
996 997 Joachim Silber jsilberro@cam.ac.uk Male
997 998 Kearney Huntly khuntlyrp@hud.gov Male
998 999 Robb Eads readsrq@sfgate.com Male
999 1000 Hillery Dickings hdickingsrr@gizmodo.com Male

1000 rows × 5 columns

df_items

id product prices
0 1001 Bread Ww Cluster 55.0
1 1002 Glove - Cutting 34.0
2 1003 Ice Cream - Super Sandwich 26.0
3 1004 Wine - Chateau Bonnet 52.0
4 1005 Ecolab - Ster Bac 46.0
... ... ... ...
992 1993 Table Cloth 53x53 White 3.0
993 1994 Durian Fruit 51.0
995 1996 Chinese Foods - Plain Fried Rice 23.0
996 1997 Wine - White, Ej Gallo 57.0
998 1999 Sugar - Fine 84.0

836 rows × 3 columns