Analyzing Starbucks Offers Data Set

Problem Statement

Data Files

Approach

Lets Get to Know The Data We Have..

portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)
portfolio.head()
profile.head()
transcripts.head()

Lets Explore and Make it Prettier

portfolio.rename(columns={"id": "offer_id"}, inplace=True)
binarizer_obj = MultiLabelBinarizer()
binarizer_obj.fit(portfolio['channels'])
offer_channels_df = pd.DataFrame(binarizer_obj.transform(portfolio['channels']),columns=binarizer_obj.classes_)
age_outlier_df = profile[profile.age == 118]
age_outlier_df.shape
profile.groupby('gender')['age'].count().plot(x='gender',y='count',kind="bar",title="Gender Distribution");
current_palette = sns.color_palette()sns.set(font_scale=1.5)
sns.set_style('white')
fig, ax = plt.subplots(figsize=(15, 6),
nrows=1,
ncols=2,
sharex=True,
sharey=True)
plt.sca(ax[0])
sns.distplot(male_customers['income'] * 1E-3,
color=current_palette[1])
plt.xlabel('Income [000]')
plt.ylabel('P(Income)')
plt.title('Male Customer Income')
plt.sca(ax[1])
sns.distplot(female_customers['income'] * 1E-3,
color=current_palette[0])
plt.xlabel('Income [000]')
plt.ylabel('P(Income)')
plt.title('Female Customer Income')
plt.tight_layout()

The Unification Process

Models

X = dataset.copy().drop(columns=['offer_successful'])
y = dataset['offer_successful']
(X_train,
X_test,
y_train,
y_test) = train_test_split(X,
y,
test_size=0.2,
random_state=random_state)
svc_model = SVC(random_state=random_state)
svc_model.fit(X_train, y_train)
y_pred = svc_model.predict(X_test)
cm = confusion_matrix( y_test, y_pred)
sns.heatmap(cm, annot=True);
lr_model = LogisticRegression(random_state=random_state,
solver='liblinear')
lr_model.fit(X_train, y_train)
y_lr_pred = lr_model.predict(X_test)
cm = confusion_matrix( y_test, y_lr_pred)
sns.heatmap(cm, annot=True);
rf_clf = RandomForestClassifier(random_state=random_state)# Number of trees in random forest
n_estimators = [100, 200, 300]
# Maximum number of levels in tree
max_depth = [int(x) for x in range(5, 11)]
# Minimum number of samples required to split a node
min_samples_split = [2, 4, 6]
# Minimum number of samples required at each leaf node
min_samples_leaf = [ 2, 4]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf}
randf = GridSearchCV(estimator = rf_clf,
param_grid = random_grid,
cv = 3,
verbose=2,
n_jobs = 3)
randf.fit(X_train, y_train)#predict
y_rf_opt_pred = randf.best_estimator_.predict(X_test)
rf_cm1 = confusion_matrix( y_test, y_rf_opt_pred)
sns.heatmap(rf_cm1, annot=True);

Results

Challenges Faced

Conclusion

Manager Strategic Cloud Services (Oracle Cloud ERP, EPM), Integration Specialist, Big Data, Data Science & Python Enthusiast)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store