Synthetic data based on pareto principle
Nowadays, the phrase “data is the new oil” seems a little cliche. However, it is true. The demand for data has grown significantly, as people and businesses have realized the possibilities of leveraging data for informed decision making. Often, it is not possible to get data to our specific needs. As a ML teacher, I have heard my students claim that it is really challenging for them to find data to their specifications for several of the projects that they want to pursue (in this case for learning data science or for school projects). In other cases, where confidentiality is crucial, the possibility of open data is minimal (anonamized data is there but that is another difficult avenue). Synthetic data can help solve these issues. Synthetic data is “any production data applicable to a given situation that are not obtained by direct measurement” according to the McGraw-Hill Dictionary of Scientific and Technical Terms. However, synthetic does not imply random. The patterns in synthetic data should mimic real world patterns. This is a priliminary exercise.
Creating a synthetic dataset for users and users transactions that demonstrate the Pareto principle. Compulsory fields include basic demographic information, email address, phone number, item number and price.
Imports
import pandas as pd
import numpy as np, numpy.random
import random
import matplotlib.pyplot as plt
Brute Force Solution
Brute Force Assumption: All ticket size the same, Customers only purchase once.
Create a naive list for users
This will be modified to look real soon. In real world case, this can be thought of user/client table with incremental id.
# variables
final_dataframe_size = 1000 # in this case, these are the 1000 customers who have bought something.
# Generate users
users = [x for x in range (1000, 10000)]
users[:10]
[1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009]
# shuffle the original list
random.shuffle(users)
users[:10]
[8815, 2924, 6563, 3568, 3604, 6821, 1466, 9813, 3153, 6637]
Sampling random 1000 users who have transacted. This is because of the assumption that not every user will have transacted.
tran_users = random.sample(users, final_dataframe_size)
print (len(tran_users))
print (tran_users[:10])
1000
[6113, 5352, 4711, 2600, 5643, 3009, 7765, 7301, 1399, 2336]
pareto_tran_users = random.sample(tran_users, int(.2 * final_dataframe_size))
len(pareto_tran_users)
200
df = pd.DataFrame({'user': tran_users})
df.head()
user | |
---|---|
0 | 6113 |
1 | 5352 |
2 | 4711 |
3 | 2600 |
4 | 5643 |
If the sum of all transactions is 100000,
pareto users = 0.8*100000 = 80000 to be divided into 200 users, each user gets = 400
non pareto users = .2*100000 = 20000 to be divided into 800 users, each user gets = 25
def assign_amount(val):
if val in pareto_tran_users:
return 400
return 25
df['transaction'] = df.user.apply(assign_amount)
# validate pareto
df.sort_values('transaction', ascending=False).head(int(.2 * final_dataframe_size)).sum()/df.sum()
user 0.20033
transaction 0.80000
dtype: float64
Here we see 20 percen generating 80 percent of the revenue.
Improvement 1
Make the sales figure more dynamic (instead of 25 and 400)
def sum_to_x(n, x):
values = [0.0, x] + list(np.random.uniform(low=0.0,high=x,size=n-1))
values.sort()
return [values[i+1] - values[i] for i in range(n)]
df2 = df.copy()
df2_p = df2[df2.user.isin(pareto_tran_users)]
df2_p_n = df2[~df2.user.isin(pareto_tran_users)]
# .apply(sum_to_x(20000, 80000))
df2_p['transaction'] = sum_to_x(200, 80000)
df2_p_n['transaction'] = sum_to_x(800, 20000)
df2 = df2_p.append(df2_p_n)
df2
user | transaction | |
---|---|---|
4 | 5643 | 271.252382 |
16 | 1403 | 64.145236 |
20 | 2671 | 150.894671 |
27 | 1881 | 3.295457 |
29 | 4574 | 129.600777 |
... | ... | ... |
991 | 7792 | 58.155219 |
992 | 4682 | 18.239724 |
994 | 5986 | 4.711870 |
996 | 8145 | 0.914484 |
998 | 4102 | 20.064127 |
1000 rows × 2 columns
df2.sum()
user 5553121.0
transaction 100000.0
dtype: float64
# validate pareto
df2.sort_values('transaction', ascending=False).head(200).sum().apply(lambda x: '%.3f' % x)
user 1136745.000
transaction 82287.864
dtype: object
Upload some data
To make the data more realistic, we can add more features. In this case, mockaroo was used to create the features.
df_users = pd.read_csv('synthetic_users.csv')
df_users.head()
id | first_name | last_name | gender | ||
---|---|---|---|---|---|
0 | 1 | Pierette | Deners | pdeners0@1und1.de | Female |
1 | 2 | Gage | Opdenorth | gopdenorth1@hugedomains.com | Male |
2 | 3 | Raquel | Gabriel | rgabriel2@studiopress.com | Female |
3 | 4 | Matti | Sullly | msullly3@soundcloud.com | Female |
4 | 5 | Zita | Doggart | zdoggart4@ftc.gov | Female |
The client table seems complete. Let us create an item table which follows normal distribution.
df_items = pd.read_csv('synthetic_items_catalog.csv')
df_items.drop_duplicates(subset=['product'], keep='first', inplace=True)
df_items['id'] = df_items['id'] + 1000
df_items
id | product | |
---|---|---|
0 | 1001 | Bread Ww Cluster |
1 | 1002 | Glove - Cutting |
2 | 1003 | Ice Cream - Super Sandwich |
3 | 1004 | Wine - Chateau Bonnet |
4 | 1005 | Ecolab - Ster Bac |
... | ... | ... |
992 | 1993 | Table Cloth 53x53 White |
993 | 1994 | Durian Fruit |
995 | 1996 | Chinese Foods - Plain Fried Rice |
996 | 1997 | Wine - White, Ej Gallo |
998 | 1999 | Sugar - Fine |
836 rows × 2 columns
prices = np.random.normal(50, 20, 836)
df_items['prices'] = prices
df_items['prices'] = np.where(df_items['prices'] < 0 , 2.0, df_items['prices'])
df_items['prices'] = round(df_items['prices'])
df_items.describe()
id | prices | |
---|---|---|
count | 836.000000 | 836.000000 |
mean | 1469.366029 | 50.340909 |
std | 285.036373 | 20.400653 |
min | 1001.000000 | 2.000000 |
25% | 1220.500000 | 36.000000 |
50% | 1460.500000 | 50.000000 |
75% | 1710.250000 | 64.000000 |
max | 1999.000000 | 120.000000 |
Now we have users and product table. The product table consists of items that are uniformly distributed.
np.random.normal(0, 1, 10)
array([ 0.7365823 , 0.18626006, -0.10850225, -0.8307533 , 0.18490917,
0.51846682, -1.20012347, 0.31953487, -0.12546108, 0.75146225])
## Could not figure out how to generate normal distribution frame,
## so opting out for uniform distribution
# weights_users = np.random.normal(1000, 1000, len(df_users.id))
weights_items = np.random.normal(1000, 1000, len(df_items.id))
tran_user = random.choices(random.sample(list(df_users.id), 268), k=10000)
tran_items = random.choices(list(df_items.id), weights = weights_items, k=10000)
df = pd.DataFrame.from_dict({'user_id': tran_user, 'item_id': tran_items})\
.set_index('item_id')\
.join(df_items.set_index('id'))
df_ = df.groupby('user_id')[['prices']].sum()
df_.describe()
prices | |
---|---|
count | 268.000000 |
mean | 1889.985075 |
std | 343.707196 |
min | 1161.000000 |
25% | 1649.750000 |
50% | 1896.500000 |
75% | 2114.500000 |
max | 2860.000000 |
# validate pareto
# regardless of transaction table, hamro users 200 jana
df_.reset_index(inplace=True)
(df_.sort_values('prices', ascending=False).head(200).sum())/df_.sum()
user_id 0.732714
prices 0.803493
dtype: float64
merged = df.reset_index()\
.set_index('user_id')\
.join(df_users.set_index('id'), how='outer')\
.reset_index()\
.rename(columns={'index':'product_id', 'level_0': 'user_id'})
merged
user_id | product_id | product | prices | first_name | last_name | gender | ||
---|---|---|---|---|---|---|---|---|
0 | 1 | 1013.0 | Shrimp - Black Tiger 13/15 | 78.0 | Pierette | Deners | pdeners0@1und1.de | Female |
1 | 1 | 1025.0 | Longos - Lasagna Veg | 48.0 | Pierette | Deners | pdeners0@1und1.de | Female |
2 | 1 | 1032.0 | Wine - Pinot Grigio Collavini | 54.0 | Pierette | Deners | pdeners0@1und1.de | Female |
3 | 1 | 1067.0 | Shiro Miso | 85.0 | Pierette | Deners | pdeners0@1und1.de | Female |
4 | 1 | 1073.0 | Pea - Snow | 69.0 | Pierette | Deners | pdeners0@1und1.de | Female |
... | ... | ... | ... | ... | ... | ... | ... | ... |
10727 | 1000 | 1866.0 | Figs | 56.0 | Hillery | Dickings | hdickingsrr@gizmodo.com | Male |
10728 | 1000 | 1902.0 | Wine - Harrow Estates, Vidal | 62.0 | Hillery | Dickings | hdickingsrr@gizmodo.com | Male |
10729 | 1000 | 1924.0 | Pancetta | 25.0 | Hillery | Dickings | hdickingsrr@gizmodo.com | Male |
10730 | 1000 | 1933.0 | Vodka - Smirnoff | 23.0 | Hillery | Dickings | hdickingsrr@gizmodo.com | Male |
10731 | 1000 | 1994.0 | Durian Fruit | 51.0 | Hillery | Dickings | hdickingsrr@gizmodo.com | Male |
10732 rows × 8 columns
merged.groupby('user_id')[['prices']].sum().sort_values('prices', ascending=False).head(200).sum()/merged['prices'].sum()
prices 0.803493
dtype: float64
merged.to_csv('merged.csv', index=False)
df_users.to_csv('users.csv', index=False)
df_items.to_csv('items.csv', index=False)
merged
user_id | product_id | product | prices | first_name | last_name | gender | ||
---|---|---|---|---|---|---|---|---|
0 | 1 | 1013.0 | Shrimp - Black Tiger 13/15 | 78.0 | Pierette | Deners | pdeners0@1und1.de | Female |
1 | 1 | 1025.0 | Longos - Lasagna Veg | 48.0 | Pierette | Deners | pdeners0@1und1.de | Female |
2 | 1 | 1032.0 | Wine - Pinot Grigio Collavini | 54.0 | Pierette | Deners | pdeners0@1und1.de | Female |
3 | 1 | 1067.0 | Shiro Miso | 85.0 | Pierette | Deners | pdeners0@1und1.de | Female |
4 | 1 | 1073.0 | Pea - Snow | 69.0 | Pierette | Deners | pdeners0@1und1.de | Female |
... | ... | ... | ... | ... | ... | ... | ... | ... |
10727 | 1000 | 1866.0 | Figs | 56.0 | Hillery | Dickings | hdickingsrr@gizmodo.com | Male |
10728 | 1000 | 1902.0 | Wine - Harrow Estates, Vidal | 62.0 | Hillery | Dickings | hdickingsrr@gizmodo.com | Male |
10729 | 1000 | 1924.0 | Pancetta | 25.0 | Hillery | Dickings | hdickingsrr@gizmodo.com | Male |
10730 | 1000 | 1933.0 | Vodka - Smirnoff | 23.0 | Hillery | Dickings | hdickingsrr@gizmodo.com | Male |
10731 | 1000 | 1994.0 | Durian Fruit | 51.0 | Hillery | Dickings | hdickingsrr@gizmodo.com | Male |
10732 rows × 8 columns
df_users
id | first_name | last_name | gender | ||
---|---|---|---|---|---|
0 | 1 | Pierette | Deners | pdeners0@1und1.de | Female |
1 | 2 | Gage | Opdenorth | gopdenorth1@hugedomains.com | Male |
2 | 3 | Raquel | Gabriel | rgabriel2@studiopress.com | Female |
3 | 4 | Matti | Sullly | msullly3@soundcloud.com | Female |
4 | 5 | Zita | Doggart | zdoggart4@ftc.gov | Female |
... | ... | ... | ... | ... | ... |
995 | 996 | Web | Beauly | wbeaulyrn@drupal.org | Male |
996 | 997 | Joachim | Silber | jsilberro@cam.ac.uk | Male |
997 | 998 | Kearney | Huntly | khuntlyrp@hud.gov | Male |
998 | 999 | Robb | Eads | readsrq@sfgate.com | Male |
999 | 1000 | Hillery | Dickings | hdickingsrr@gizmodo.com | Male |
1000 rows × 5 columns
df_items
id | product | prices | |
---|---|---|---|
0 | 1001 | Bread Ww Cluster | 55.0 |
1 | 1002 | Glove - Cutting | 34.0 |
2 | 1003 | Ice Cream - Super Sandwich | 26.0 |
3 | 1004 | Wine - Chateau Bonnet | 52.0 |
4 | 1005 | Ecolab - Ster Bac | 46.0 |
... | ... | ... | ... |
992 | 1993 | Table Cloth 53x53 White | 3.0 |
993 | 1994 | Durian Fruit | 51.0 |
995 | 1996 | Chinese Foods - Plain Fried Rice | 23.0 |
996 | 1997 | Wine - White, Ej Gallo | 57.0 |
998 | 1999 | Sugar - Fine | 84.0 |
836 rows × 3 columns