Please use the attached sample and Project reports.

**How to Prepare the Portfolio**

The portfolio report must be typewritten and should be a minimum of 3 complete pages in length. A project report that is two and a half pages (2.5 pages) is **not acceptable. **Large margins that have been increased to meet the length requirement are also **not acceptable**. If your report is not submitted according to the stated specifications, you will be asked to re-write it.

Do not submit the original project; the report is meant to capture the project highlights and significant points of the original project.

You will write a report on the project that includes:

- Introduction
- Explanation of the solution
- Description of the results
- Description of your contributions to the project
- Explanation of what new skills, techniques, or knowledge you acquired from the project and if it was a group project, you should also include a list of team members who worked on the project.
- A reference section with at least 4 references and citations to those references inside the text. Use the
__IEEE Referencing Style Sheet__for citation guidelines.

1

Individual Contribution Report

Pradeep Peddnade

Id: 1220962574

2

Reflection:

My overall role in the team was Data Analyst where I was responsible for combining

theory in the group and practices to make and communicate data insights that enabled my

team to make informed inferences regarding the data. Through skills such as data analytics and

statistical modeling, my role as a data analyst was crucial in mining and gathering data. Once

data is ready, performed exploratory analysis for native-country, race, education, and work

class variables of the dataset.

The other role was charged with as a data analyst in the group was to apply statistical

tools to construe the mined data by giving specific attention to the trends and the various

patterns which would lead to predictive analytics to enable the group to make informed

decisions and predictions.

Another role that I did for the group was to work on data cleansing. The specific role

involved managing data though procedure that ensures data us properly formatted and

irrelevant data points are removed.

Lessons Learned:

The wisdom that I would share with others regarding research design is to ensure that

the design is straightforward and aimed towards answering the research question. Having an

appropriate research design will assist the group to answer the research question effectively. I

would also share with the team that it is very appropriate to consider at the time of data

collection from sources and analyze the data into something that the researcher the team

would want to consider. On how to best apply them is to consider that it is appropriate for the

team to ensure that the data is analyzed appropriately and structured appropriately. Make sure

data is cleansed and outliers are removed or normalized.

From the group, we can conclude that the research was an honest effort that was

established to identify how the lessons learned are beyond the project. The data analytics skills

ensured that the analyzed data was collected from the primary sources of data, this prevent

3

the group from the biasedness of another research that was previously conducted. In this, data

world there is unlimited data choosing right variable among the data to answer the research

questions is very important by using correlation and other techniques.

Assessment:

Additional skills that I learned from the course and during the project work is choosing

the visualization type and variables from data set, which is a very important in the analysis of

data. Through this skill, I was able to conceptualize and properly analyze and interpret big data

that requires data modeling and management. Despite that is through the group that I was able

to develop my communication skills since the data analytic role needed an excellent

communicator who would interpret and explain the various inferences to my group.

Group members are in different time zones, scheduling a time to meet was

strenuousness. Everyone in the team was accommodating.

Future Application:

In my current role, I will analyze the metrics of the cluster and logs to monitor the health

of the different services using Elasticsearch Kibana and Grafana. The topics I learned in this

course will be greatly useful and I can apply it in building metrics based Kibana dashboard for

Management to see the usage and cost incurred for each service running in the cluster. And I

will use statistical methods on picking the fields interested among thousands of available fields.

4

2/26/22, 9:04 PM CSE578Project

localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 1/13

In [71]: import pandas as pd

import numpy as np

from collections import Counter

import matplotlib.pyplot as plt

import numpy

from statsmodels.graphics.mosaicplot import mosaic

from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_scor

e, f1_score

from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, tr

ain_test_split

import warnings

%matplotlib inline

df = pd.read_csv(“data/adult.data”, header=None, sep=”, “)

df.columns = [“age”, “workclass”, “fnlwgt”, “education”, “education-num”

, “marital-status”, “occupation”, “relationship”, “race”, “sex”, “capita

l-gain”, “capital-loss”, “hours-per-week”, “native-country”, “class”]

df = df[df[“workclass”] != ‘?’]

df = df[df[“education”] != ‘?’]

df = df[df[“marital-status”] != ‘?’]

df = df[df[“occupation”] != ‘?’]

df = df[df[“relationship”] != ‘?’]

df = df[df[“race”] != ‘?’]

df = df[df[“sex”] != ‘?’]

df = df[df[“native-country”] != ‘?’]

below = df[df[“class”] == “<=50K”]

above = df[df[“class”] == “>50K”]

<ipython-input-71-d873bf4dac12>:19: ParserWarning: Falling back to the

‘python’ engine because the ‘c’ engine does not support regex separator

s (separators > 1 char and different from ‘s+’ are interpreted as rege

x); you can avoid this warning by specifying engine=’python’.

df = pd.read_csv(“data/adult.data”, header=None, sep=”, “)

2/26/22, 9:04 PM CSE578Project

localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 2/13

In [61]: above_50k = Counter(above[‘native-country’])

below_50k = Counter(below[‘native-country’])

print(‘native-country’)

fig, axes = plt.subplots(ncols=1, nrows=2, figsize=(5,10))

axes[0].pie(above_50k.values(), labels=above_50k.keys(), autopct=’%1.0f%

%’)

axes[0].set_title(“>50K”)

axes[1].pie(below_50k.values(), labels=below_50k.keys(), autopct=’%1.0f%

%’)

axes[1].set_title(“<=50K”)

plt.show()

native-country

2/26/22, 9:04 PM CSE578Project

localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 3/13

In [62]: above_50k = Counter(above[‘race’])

below_50k = Counter(below[‘race’])

print(‘race’)

fig, axes = plt.subplots(ncols=1, nrows=2, figsize=(5,10))

axes[0].pie(above_50k.values(), labels=above_50k.keys(), autopct=’%1.0f%

%’)

axes[0].set_title(“>50K”)

axes[1].pie(below_50k.values(), labels=below_50k.keys(), autopct=’%1.0f%

%’)

axes[1].set_title(“<=50K”)

plt.show()

race

2/26/22, 9:04 PM CSE578Project

localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 4/13

In [63]: above_50k = Counter(above[‘education’])

below_50k = Counter(below[‘education’])

print(‘education’)

fig, axes = plt.subplots(ncols=1, nrows=2, figsize=(5,10))

axes[0].pie(above_50k.values(), labels=above_50k.keys(), autopct=’%1.0f%

%’)

axes[0].set_title(“>50K”)

axes[1].pie(below_50k.values(), labels=below_50k.keys(), autopct=’%1.0f%

%’)

axes[1].set_title(“<=50K”)

plt.show()

education

2/26/22, 9:04 PM CSE578Project

localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 5/13

In [64]: above_50k = Counter(above[‘workclass’])

below_50k = Counter(below[‘workclass’])

print(‘workclass’)

fig, axes = plt.subplots(ncols=1, nrows=2, figsize=(5,10))

axes[0].pie(above_50k.values(), labels=above_50k.keys(), autopct=’%1.0f%

%’)

axes[0].set_title(“>50K”)

axes[1].pie(below_50k.values(), labels=below_50k.keys(), autopct=’%1.0f%

%’)

axes[1].set_title(“<=50K”)

plt.show()

workclass

2/26/22, 9:04 PM CSE578Project

localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 6/13

In [65]: fig, axes = plt.subplots(ncols=2, nrows=3, figsize=(8,8))

fig.subplots_adjust(hspace=.5)

x = below[‘capital-gain’]

y = below[‘age’]

axes[0, 0].scatter(x,y)

axes[0, 0].set_title(“<=50K”)

axes[0, 0].set_xlabel(‘capital-gain’)

axes[0, 0].set_ylabel(‘age’)

x = above[‘capital-gain’]

y = above[‘age’]

axes[0, 1].scatter(x,y)

axes[0, 1].set_title(“>50K”)

axes[0, 1].set_xlabel(‘capital-gain’)

axes[0, 1].set_ylabel(‘age’)

x = below[‘age’]

y = below[‘hours-per-week’]

axes[1, 0].scatter(x,y)

axes[1, 0].set_title(“<=50K”)

axes[1, 0].set_xlabel(‘age’)

axes[1, 0].set_ylabel(‘hours-per-week’)

x = above[‘age’]

y = above[‘hours-per-week’]

axes[1, 1].scatter(x,y)

axes[1, 1].set_title(“>50K”)

axes[1, 1].set_xlabel(‘age’)

axes[1, 1].set_ylabel(‘hours-per-week’)

x = below[‘hours-per-week’]

y = below[‘capital-gain’]

axes[2, 0].scatter(x,y)

axes[2, 0].set_title(“<=50K”)

axes[2, 0].set_xlabel(‘hours-per-week’)

axes[2, 0].set_ylabel(‘capital-gain’)

x = above[‘hours-per-week’]

y = above[‘capital-gain’]

axes[2, 1].scatter(x,y)

axes[2, 1].set_title(“>50K”)

axes[2, 1].set_xlabel(‘hours-per-week’)

axes[2, 1].set_ylabel(‘capital-gain’)

plt.show()

2/26/22, 9:04 PM CSE578Project

localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 7/13

2/26/22, 9:04 PM CSE578Project

localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 8/13

In [50]: fig, axes = plt.subplots(ncols=1, nrows=1, figsize=(15,10))

fig.subplots_adjust(hspace=.5)

mosaic(df, [‘occupation’, ‘class’], ax=axes, axes_label=False)

plt.show()

2/26/22, 9:04 PM CSE578Project

localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 9/13

In [51]: fig, axes = plt.subplots(ncols=1, nrows=1, figsize=(15,10))

fig.subplots_adjust(hspace=.5)

mosaic(df, [‘marital-status’, ‘class’], ax=axes, axes_label=False)

plt.show()

2/26/22, 9:04 PM CSE578Project

localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 10/13

In [54]: fig, axes = plt.subplots(ncols=1, nrows=1, figsize=(15,12))

fig.subplots_adjust(hspace=.5)

mosaic(df, [‘education-num’, ‘class’], ax=axes, axes_label=False)

plt.show()

2/26/22, 9:04 PM CSE578Project

localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 11/13

In [90]: train = df

train = train.drop(“capital-loss”, axis=1)

train = train.drop(“native-country”, axis=1)

train = train.drop(“fnlwgt”, axis=1)

train = train.drop(“education”,axis=1)

def get_occupation(x):

if x in [“Exec-managerial”, “Prof-specialty”, “Protective-serv”]:

return 1

elif x in [“Sales”, “Transport-moving”, “Tech-support”, “Craft-repai

r”]:

return 2

else:

return 3

def get_relationship(x):

if x == “Own-child”:

return 6

elif x == “Other-relative”:

return 5

elif x == “Unmarried”:

return 4

elif x == “Not-in-family”:

return 3

elif x == “Husband”:

return 2

else:

return 1

def get_race(x):

if x == “Other”:

return 5

elif x == “Amer-Indian-Eskimo”:

return 4

elif x == “Black”:

return 3

elif x == “White”:

return 2

else:

return 1

def get_sex(x):

if x == “Male”:

return 2

else:

return 1

def get_class(x):

if x == “>50K”:

return 1

else:

return 0

def get_workclass(x):

if x == “Without-pay”:

2/26/22, 9:04 PM CSE578Project

localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 12/13

return 7

elif x == “Private”:

return 6

elif x == “State-gov”:

return 5

elif x == “Self-emp-not-inc”:

return 4

elif x == “Local-gov”:

return 3

elif x == “Federal-gov”:

return 2

else:

return 1

def get_marital_status(x):

if x == “Never-married”:

return 7

elif x == “Separated”:

return 6

elif x == “Married-spouse-absent”:

return 5

elif x == “Widowed”:

return 4

elif x == “Divorced”:

return 3

elif x == “Married-civ-spouse”:

return 2

else:

return 1

train[‘workclass’] = train[‘workclass’].apply(get_workclass)

train[‘marital-status’] = train[‘marital-status’].apply(get_marital_stat

us)

train[‘occupation’] = train[‘occupation’].apply(get_occupation)

train[‘relationship’] = train[‘relationship’].apply(get_relationship)

train[‘race’] = train[‘race’].apply(get_race)

train[‘sex’] = train[‘sex’].apply(get_sex)

train[‘class’] = train[‘class’].apply(get_class)

Out[90]:

age workclass

education-

num

marital-

status occupation relationship race sex

capital-

gain

hours-

per-

week

cla

0 39 5 13 7 3 3 2 2 2174 40

1 50 4 13 2 1 2 2 2 0 13

2 38 6 9 3 3 3 2 2 0 40

3 53 6 7 2 3 2 3 2 0 40

4 28 6 13 2 1 1 3 1 0 40

2/26/22, 9:04 PM CSE578Project

localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 13/13

In [96]: test=pd.read_csv(“data/adult.test”, header=None, sep=”, “)

feature = train.iloc[:, :-1]

labels = train.iloc[:, -1]

feature_matrix1 = feature.values

labels1 = labels.values

train_data, test_data, train_labels, test_labels = train_test_split(feat

ure_matrix1, labels1, test_size=0.2, random_state=42)

transformed_train_data = MinMaxScaler().fit_transform(train_data)

transformed_test_data = MinMaxScaler().fit_transform(test_data)

In [97]: t

In [114]: mod=LogisticRegression().fit(transformed_train_data,train_labels)

test_predict=mod.predict(transformed_test_data)

acc=accuracy_score(test_labels, test_predict)

f1=f1_score(test_labels, test_predict)

prec=precision_score(test_labels,test_predict)

rec=recall_score(test_labels, test_predict)

In [115]: print(“%.4ft%.4ft%.4ft%.4ft%s” % (acc, f1, prec, rec, ‘Logistic Regr

ession’))

In [ ]:

<ipython-input-96-90f00b23459c>:1: ParserWarning: Falling back to the

‘python’ engine because the ‘c’ engine does not support regex separator

s (separators > 1 char and different from ‘s+’ are interpreted as rege

x); you can avoid this warning by specifying engine=’python’.

test=pd.read_csv(“data/adult.test”, header=None, sep=”, “)

0.8409 0.6404 0.7500 0.5588 Logistic Regression

Course Title Portfolio

Name

Email

Abstract—This document This document This document

This document This document This document This document

This document This document This document This document

This document This document This document This document

This document This document This document This document

This document This document This document This document

This document This document This document This document

This document This document This document This document

This document.

Keywords—mean, standard deviation, variance, probability

density function, classifier

I. INTRODUCTION

This document This document This document This

document This document This document This document This

document This document This document This document This

document This document This document This document This

document. [1].

This project practiced the use of density estimation

through several calculations via the Naïve Bayes Classifier.

The data for each equation was used to find the probability of

the mean for. Without using a built-in function, the first

feature, the mean, could be calculated using the equation in

Fig. 1. The second feature, the standard deviation, could be

calculated using the equation in Fig. 2. Utilizing the training

set for digit 0, the mean of the pixel brightness values was

determined by calling ‘numpy.mean()digit 0 or digit 1. The

test images were then classified based on the previous

calculations and the accuracy of the computations were

determined.

The project consisted of 4 tasks:

A. Extract features from the original training set

There were two features that needed to be extracted from

the original training set for each image. The first feature was

the average pixel brightness values within an image array.

The second was the standard deviation of all pixel

brightness values within an image array.

B. Calculate the parameters for the two-class Naïve Bayes

Classifiers

Using the features extracted from task A, multiple

calculations needed to be performed. For the training set

involving digit 0, the mean of all the average brightness

values was calculated. The variance was then calculated for

the same feature, regarding digit 0. Next, the mean of the

standard deviations involving digit 0 had to be computed. In

addition, the variance for the same feature was determined.

These four calculations had to then be repeated using the

training set for digit 1.

C. Classify all unknown labels of incoming data

Using the parameters obtained in task B, every image in

each testing sample had to be compared with the

corresponding training set for that particular digit, 0 or 1.

The probability of that image being a 0 or 1 needed to be

determined so it can then be classified.

D. Calculate the accuracy of the classifications

Using the predicted classifications from task C, the

accuracy of the predictions needed to be calculated for both

digit 0 and digit 1, respectively.

Each equation was used to find the probability of the

mean for. Without using a built-in function, the first feature,

the mean, could be calculated using the equation in Fig. 1.

The second feature, the standard deviation, could be

calculated using the equation in Fig. 2. Utilizing the training

set for digit 0, the mean of the pixel brightness values was

determined by calling ‘numpy.mean()of the data. These

features helped formulate the probability density function

when determining the classification.

II. DESCRIPTION OF SOLUTION

This project required a series of computations in order to

successfully equation was used to find the probability of the

mean for. Without using a built-in function, the first feature,

the mean, could be calculated using the equation in Fig. 1.

The second feature, the standard deviation, could be

calculated using the equation in Fig. 2. Utilizing the training

set for digit 0, the mean of the pixel brightness values was

determined by calling ‘numpy.mean(). Once acquiring the

data, the appropriate calculations could be made.

A. Finding the mean and standard deviation

The data was provided in the form of NumPy arrays,

which made it useful for performing routine mathematical

operations equation was used to find the probability of the

mean for. Without using a built-in function, the first feature,

the mean, could be calculated using the equation in Fig. 1.

The second feature, the standard deviation, could be

calculated using the equation in Fig. 2. Utilizing the training

set for digit 0, the mean of the pixel brightness values was

determined by calling ‘numpy.mean()by calling

‘numpy.std()’, another useful NumPy function. These

extracted features from the training set for digit 0 also had to

be evaluated from the training set for digit 1. Once all the

features for each image were obtained from both training

sets, the next task could be completed.

Equ. 1. Mean formula

B. Determining the parameters for the Naïve Bayes

Classifiers

To equation was used to find the probability of the mean

for. Without using a built-in function, the first feature, the

mean, could be calculated using the equation in Fig. 1. The

second feature, the standard deviation, could be calculated

using the equation in Fig. 2. Utilizing the training set for

digit 0, the mean of the pixel brightness values was

determined by calling ‘numpy.mean() and the array of the

standard deviations created for digit 1.

Equ. 2. Variance formula

This equation was used to find the probability of the

mean for. Without using a built-in function, the first feature,

the mean, could be calculated using the equation in Fig. 1.

The second feature, the standard deviation, could be

calculated using the equation in Fig. 2. Utilizing the training

set for digit 0, the mean of the pixel brightness values was

determined by calling ‘numpy.mean()’ for each image in the

set. In addition, the standard deviation of the pixel

brightness values was calculated for each image by calling

‘numpy.std()’, another useful NumPy function. These

extracted features from the training. This was multiplied by

the prior probability, which is 0.5 in this case because the

value is either a 0 or a 1.

This ]. Without using a built-in function, the first feature,

the mean, could be calculated using the equation in Fig. 1.

The second feature, the standard deviation, could be

calculated using the equation in Fig. 2. Utilizing the training

set for digit 0, the mean of the pixel brightness values was

determined by calling ‘numpy.mean()’ for each image in the

set. In addition, the standard deviation of the pixel

brightness values was calculated for each image by calling

‘numpy.std()’, another useful NumPy function. These

extracted features from the training.

This entire procedure had to be conducted once again but

utilizing the test sample for digit 1 instead. This meant

finding the mean and standard deviation of each image, using

the probability density function to calculate the probability of

the mean and probability of the standard deviation for digit 0,

and calculating the probability that the image is classified as

digit 0. The same operations had to be performed again, but

for the training set for digit 1. The probability of the image

being classified as digit 0 had to be compared to the

probability of the image being classified as digit 1. Again,

the larger of the two values suggested which digit to classify

as the label.

One aspect of machine learning that I understood better

after completion of the project was Gaussian distribution.

This normalized distribution style displays a bell-shape of

data in which the peak of the bell is where the mean of the

data is located [4]. A bimodal distribution is one that

displays two bell-shaped distributions on the same graph.

After calculating the features for both digit 0 and digit 1, the

probability density function gave statistical odds of that

particular image being classified under a specific bell-

shaped curve. An example of a bimodal distribution can be

seen in Fig. 7 below.

C. Determining the accuracy of the label

The mean for. Without using a built-in function, the first

feature, the mean, could be calculated using the equation in

Fig. 1. The second feature, the standard deviation, could be

calculated using the equation in Fig. 2. Utilizing the training

set for digit 0, the mean of the pixel brightness values was

determined by calling ‘numpy.mean()’ for each image in the

set. In addition, the standard deviation of the pixel

brightness values was calculated for each image by calling

‘numpy.std()’, another useful NumPy function. These

extracted features from the by the total number of images in

the test sample for digit 1.

III. RESULTS

mean for. Without using a built-in function, the first

feature, the mean, could be calculated using the equation in

Fig. 1. The second feature, the standard deviation, could be

calculated using the equation in Fig. 2. Utilizing the training

set for digit 0, the mean of the pixel brightness values was

determined by calling ‘numpy.mean()’ for each image in the

set. In addition, the standard deviation of the pixel

brightness values was calculated for each image by calling

‘numpy.std()’, another useful NumPy function. These

extracted features from the also higher.

TABLE I. TRAINING SET FOR DIGIT 0

TTTTTTTT 000000

XXXXX 000000

When comparing the test images, the higher values of

the means and the standard deviations typically were labeled

as digit 0 and the lower ones as digit 1. However, this was

not always the case because then the calculated accuracy

would then be 100%.

The e. After classifying all the images in the test sample

for digit 0, the total amount predicted as digit 0 was 899.

This meant that the accuracy of classification was 0000%,

which can be represented in Fig. 5.

Fig. 1. Accuracy of classification for digit 0

The total amount of images in the test sample for digit 1

0000. After classifying all the images in the test sample for

digit 1, the total amount predicted as digit 00000. This

meant that the accuracy of classification was 00000%,

which can be represented in Fig. 6.

IV. LESSONS LEARNED

The procedures practiced in this project required skill in

the Python programming language, as well as understanding

concepts of statistics. It required plenty of practice to

implement statistical equations, such as finding the mean,

the standard deviation, and the variance. My foundational

knowledge of mathematical operations helped me gain an

initial understanding of how to set up classification

problems. My lack of understanding of the Python language

made it difficult to succeed initially. Proper syntax and

built-in functions had to be learned first before continuing

with solving the classification issue. For example, I had very

little understanding of NumPy prior to this project. I learned

that it was extremely beneficial for producing results of

mathematical operations. One of the biggest challenges for

me was creating and navigating through NumPy arrays

rather than a Python array. Looking back, it was a simple

issue that I solved after understanding how they were

uniquely formed. Once I had a grasp on the language and

built-in functions, I was able to create the probability

density function in the code and then apply classification

towards each image.

One aspect of machine learning that I understood better

after completion of the project was Gaussian distribution.

This normalized distribution style displays a bell-shape of

data in which the peak of the bell is where the mean of the

data is located [4]. A bimodal distribution is one that

displays two bell-shaped distributions on the same graph.

After calculating the features for both digit 0 and digit 1, the

probability density function gave statistical odds of that

particular image being classified under a specific bell-

shaped curve. An example of a bimodal distribution can be

seen in Fig. 7 below.

One aspect of machine learning that I understood better

after completion of the project was Gaussian distribution.

This normalized distribution style displays a bell-shape of

data in which the peak of the bell is where the mean of the

data is located [4]. A bimodal distribution is one that

displays two bell-shaped distributions on the same graph.

After calculating the features for both digit 0 and digit 1, the

probability density function gave statistical odds of that

particular image being classified under a specific bell-

shaped curve. An example of a bimodal distribution can be

seen in Fig. 7 below.

One aspect of machine learning that I understood better

after completion of the project was Gaussian distribution.

This normalized distribution style displays a bell-shape of

data in which the peak of the bell is where the mean of the

data is located [4]. A bimodal distribution is one that

displays two bell-shaped distributions on the same graph.

After calculating the features for both digit 0 and digit 1, the

probability density function gave statistical odds of that

particular image being classified under a specific bell-

shaped curve. An example of a bimodal distribution can be

seen in Fig. 7 below.

Fig. 2. Bimodal distribution example [5]

Upon completion of the project, I was able to realize that

completion of the project was Gaussian distribution. This

normalized distribution style displays a bell-shape of data in

which the peak of the bell is where the mean of the data is

located [4]. A bimodal distribution is one that displays two

bell-shaped distributions on the same graph. After

calculating the features for both digit 0 and digit 1, the

probability density function gave statistical odds of that

particular image being classified under a specific bell-

shaped curve. An example of a bimodal distribution can be

seen in Fig. 7 below.

after completion of the project was Gaussian distribution.

This normalized distribution style displays a bell-shape of

data in which the peak of the bell is where the mean of the

data is located [4]. A bimodal distribution is one that

displays two bell-shaped distributions on the same graph.

After calculating the features for both digit 0 and digit 1, the

probability density function gave statistical odds of that

particular image being classified under a specific bell-

shaped curve. An example of a bimodal distribution can be

seen in Fig. 7 below.

One aspect of machine learning that I understood better

after completion of the project was Gaussian distribution.

This normalized distribution style displays a bell-shape of

data in which the peak of the the project was Gaussian

distribution. This normalized distribution style the project

was Gaussian distribution. This normalized distribution

style bell is where the mean of the data is located [4]. A

bimodal distribution is one that displays classified under a

specific bell-shaped curve. An example of a bimodal

distribution can be seen in Fig. 7 below.

Accuracy for Digit 0

Predicted as

digit 0

Predicted as

digit 1

V. REFERENCES

[1] N. Kumar, Naïve Bayes Classifiers, GeeksforGeeks, May 15, 2020.

Accessed on: Oct. 15, 2021. [Online]. Available:

https://www.geeksforgeeks.org/naive-bayes-classifiers/

[2] J. Brownlee, How to Develop a CNN for MNIST Handwritten Digit

Classification, Aug. 24, 2020. Accessed on: Oct. 15, 2021. [Online].

Available: https://machinelearningmastery.com/how-to-develop-a-

convolutional-neural-network-from-scratch-for-mnist-handwritten-

digit-classification/

[3] “What is NumPy,” June 22, 2021. Accessed on: Oct. 15, 2021.

[Online]. Available:

https://numpy.org/doc/stable/user/whatisnumpy.html

[4] J. Chen, Normal Distribution, Investopedia, Sept. 27, 2021. Accessed

on: Oct. 15, 2021. [Online]. Available:

https://www.investopedia.com/terms/n/normaldistribution.asp

[5] “Bimodal Distribution,” Velaction, n.d. Accessed on: Oct. 15, 2021.

[Online]. Available: https://www.velaction.com/bimodal-distribution/

- I. Introduction
- A. Extract features from the original training set
- B. Calculate the parameters for the two-class Naïve Bayes Classifiers
- Using the features extracted from task A, multiple calculations needed to be performed. For the training set involving digit 0, the mean of all the average brightness values was calculated. The variance was then calculated for the same feature, regard…
- C. Classify all unknown labels of incoming data
- D. Calculate the accuracy of the classifications
- II. Description of Solution
- A. Finding the mean and standard deviation
- B. Determining the parameters for the Naïve Bayes Classifiers
- C. Determining the accuracy of the label
- The mean for. Without using a built-in function, the first feature, the mean, could be calculated using the equation in Fig. 1. The second feature, the standard deviation, could be calculated using the equation in Fig. 2. Utilizing the training set fo…
- III. Results
- IV. Lessons Learned
- V. References

CSE 578: Data Visualization

Systems Documentation Report

Members of Team 44: Pradeep Peddnade, Jieqiong Zhou, Tian Liang, Sukhwan Yun

1. Roles and responsibilities

Product owners: XYZ corporation

Stakeholders: UVW College

Data analysis team members:

• Pradeep Peddnade: exploratory analysis for native-country, race, education and work

class of the dataset, machining learning model training and testing of these variables.

• Jieqiong Zhou: Progress report, exploratory analysis for sex, marital-status of the

dataset.

• Tian Liang: Systems documentation report; exploratory analysis for occupation, capital-

loss, weight and working hours per week of the dataset; insight analysis for 2 variables

• Sukhwan Yun: Executive report, data exploration and data analysis of age, education–

num, capital-gain and relationship of the data set.

2. Team goals and a business objective

Our understanding of the project is to assist UVW College in their effort in boosting enrollment. They

believe they should target individuals based on their annual income. They drew a line at 50k and would

like us to classify individuals into two categories: annual salary above and below 50k.

We are going to use US census bureau data to establish correlation between annual income and the

other status and data of an individual, such as capital gain, capital loss, education, work class, marital

status, etc. We will start with an exploratory analysis to determine which parameters are important and

which ones are irrelevant. Then we will select the most relevant data for in-depth visualization and

machine learning. Eventually, we will be able to predict an individual’s annual salary based on this

person’s other status and data.

3. Assumptions

UVW College assumes people within a certain salary range are more likely to enroll in their degree

program. Therefore, they need to know if a person’s annual salary is above or below $50,000.

UVW College assumes the US census data can be used to indicate the likelihood of a person’s annual

income based on other status and data such as age, gender, education status, marital status,

occupation, etc.

It is assumed that the data from the United States Census Bureau is accurate. The data used for this

study is representative of the individuals to be included in this data analysis.

4.User Stories

User Story #1: To increase the enrollment number, a staff member of UVW marketing team would like

to know the relationship between occupation and income.

User Story #2: An associate in UVW marketing group would like to get an understanding of capital loss of

people in the data.

User Story #3: A marketing analyst suggested that work hours per week could be a factor affecting the

income of people and would like to have data to back this hypothesis

User Story #4: The director of marketing would like to know if final weight has anything to do with

income of people interviewed in the census data.

User Story #5: A senior staff member in marketing department is interested in how education-num is

correlated to income.

User Story #6: Marketing group just had a meeting on how to increase enrollment number. One of the

action items is to understand if marital status of an individual is related to this person’s salary.

User Story #7: An intern in marketing group suggested to study the relationship between capital gain

and income of individuals in this data.

User Story #8: The director of marketing asked the team members to analyze the relationship between

work class and annual salary of an individual.

5. Visualizations

Figure 1. Percentage of people with a salary > 50K for each occupation.

To visualize occupation data, we choose bar chart since it is very good for visualize categorical data.

Figure 1 shows the percentage of people with a salary > 50K for each occupation. It shows that certain

occupations such as the executive and managerial positions, professional specialty, protective services,

tech support have a higher percentage (>30%) of individuals with an annual salary more than $50,000.

While certain occupations such as private house services, other services, handlers-cleaners have a lower

percentage (<10%) of individuals with an annual salary higher than $50,000.

Figure 2. Education number of individuals with salary above (left) and below (right) $ 50,000.

To visualize education number data, we choose box chart since education number data are widespread

and box plot is very good at visualize this type of data. Figure 2 shows the education number of

individuals with salary above and below $ 50,000. In the group of individuals with annual salary above

$50,000, the median number of educations is higher than that of the group with annual salary below

$50,000. In addition, the top quartile of the people in the above $50,000 group is higher than that of the

people in the below $50,000 group. These two huge differences show that individuals with more

education are likely to be in the group with an annual salary above $50,000.

Figure 3. Percentage of individuals with annual salary above $ 50,000.

To visualize capital gain data, we choose scatter plot since it is suitable for showing relationship between

continuous data points. Figure 3 shows the percentage of individuals with annual salary above $ 50,000.

We processed the data in the following way to make the graph clearer. Each scatter in Figure 3

represents a group of individuals within a capital gain range of $2,000. For example, the scatter on the

most left side of the figure represents individuals with a capital gain between $0 to $2,000.

Based on Figure 3, we can see that for individuals with capital gain below $10,000, the percentage of

individuals with an annual salary above $50,000 is increased with the increase of capital gain. This

means there is a correlation between the salary and capital gain in this data range.

However, for individuals with a capital gain above $10,000, the correlation between salary and capital

gain is very weak, if there is any. We can see data points jump between 0 to 100% and there is no

pattern or correlation can be found. This is due to the scarcity of capital gain data points in the range

above $10,000. As we can see from Figure 4, Histogram of capital-gain, the majority of capital gain data

points are below $10,000. The data points above $10,000 are very scarce and have no statistical

significance.

In summary, annual salary is positively correlated to capital gain for individuals with capital gain less

than $10,000. For individuals with capital gain above $10,000, it is hard to draw a conclusion due to lack

of statistical significance of the data.

Figure 4. Histogram of capital-gain.

Figure 5. Histogram of capital-loss.

To explore capital loss data, we choose histogram to get some idea about if there is a correlation

between it and salary. Figure 5 shows the histogram of capital-loss. The vast majority of data points fall

in the very first bin on the left side of the figure. This indicates that there is no statistical correlation

between the capital-loss and annual salary.

Figure 6. Percentage of individuals with annual salary above $ 50,000 as a function of work hours-per-

week.

To visualize hours per week data, we choose scatter plot since it gives reader a very good idea about

continuous correlations. Figure 6 shows the percentage of individuals with annual salary above $50,000

as a function of work hours-per-week. We processed the data in the same way as we did for Figure 3 for

clarification purpose. Each scatter in Figure 6 represents a group of people work within a certain range

of hours per week. For examples, the scatter on the most left side of the graph represents people work

between 20 to25 hours per week.

We can see a very week correlation between salary and hour-per-week. Generally, individuals work less

than 40 hours per week are less likely to earn more than $50,000. While people work more than 40

hours a week are much more likely to make more than $50,000 annually.

Figure 7. Percentage of individuals with annual salary above $ 50,000 as a function of final weight.

We choose scatter plot to show the salary data as a function of final weight since final weight is a

continuous data. Figure 7 shows the percentage of individuals with annual salary above $ 50,000 as a

function of final weight. We process the data in the same as we did for Figure 6. Based on the scatter

plot, we can see there is no correlation between the salary and final weight.

Figure 8. Percentage of individuals in each work class with annual salary above $50,000 (top) and below

$50,000 (bottom).

To visualize work class data, we choose pie chart to see different classed in each salary group since pie

chart is very good at showing the composition of a data set. Figure 8 shows the percentage of individuals

in each work class with annual salary above $50,000 (top) and below $50,000 (bottom). It is very clear

that individuals work in the private sector are less likely to make more than $50,000 a year. While self

employed with an Inc are more likely to make more than $50,000 a year.

6. Questions

Question 1: There is a total of 14 parameters. Some of them are relevant to the annual salary of an

individual, while some of them are not. We need to determine which parameters to use for in-depth

analysis and machine learning.

Solution to question 1: We started with a thorough discussion during one of our team meetings at the

beginning of this project. After the discussion, we decided to start with an exploratory analysis of each

parameter. After the initial analysis, we picked 4~8 parameters which are the most relevant to an

individual’s annual salary for the next step.

Question 2: Which machine learning method we should use for this study.

Solution to question 2: There are a total of 14 parameters, with some being numerical and the others

being categorical data. The outcome is either above or below 50k. Typically for this type of outcome,

logistic regression is an ideal method for machine learning analysis. Therefore, we choose logistic

regression and found how certain parameters are related to people’s annual salary.

7. Not doing

In the machine learning analysis, we choose logistic regression model. This model is good enough to

provide information regarding the relationship between the parameters we choose, and individual’s

annual salary. In the future, more models can be included in the analysis. Several other metrics could be

used to compare different models and their accuracy.

In this study, we didn’t include parameters such as age, education, relationship, race, sex and native

country. These parameters can be included in further studies in the future.

8. Appendix

Code:

import pandas as pd

import numpy as np

from collections import Counter

import matplotlib.pyplot as plt

import numpy

from statsmodels.graphics.mosaicplot import mosaic

from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score,

f1_score

from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, trai

n_test_split

import warnings

%matplotlib inline

df = pd.read_csv(“/content/adult.data”, header=None, sep=”, “)

df.columns = [“age”, “workclass”, “fnlwgt”, “education”, “education-

num”, “marital-

status”, “occupation”, “relationship”, “race”, “sex”, “capital-

gain”, “capital-loss”, “hours-per-week”, “native-country”, “class”]

df = df[df[“workclass”] != ‘?’]

df = df[df[“education”] != ‘?’]

df = df[df[“marital-status”] != ‘?’]

df = df[df[“occupation”] != ‘?’]

df = df[df[“relationship”] != ‘?’]

df = df[df[“race”] != ‘?’]

df = df[df[“sex”] != ‘?’]

df = df[df[“native-country”] != ‘?’]

below = df[df[“class”] == “<=50K”]

above = df[df[“class”] == “>50K”]

figg = plt.figure()

axx = figg.gca()

below.boxplot(column=’education-num’, ax =axx)

axx.set_title(“Boxplot with a salary <= 50K for each education-num”)

plt.show()

above_50k = Counter(above[‘workclass’])

below_50k = Counter(below[‘workclass’])

print(‘workclass’)

fig, axes = plt.subplots(ncols=1, nrows=2, figsize=(5,10))

axes[0].pie(above_50k.values(), labels=above_50k.keys(), autopct=’%1.0f%%’

)

axes[0].set_title(“>50K”)

axes[1].pie(below_50k.values(), labels=below_50k.keys(), autopct=’%1.0f%%’

)

axes[1].set_title(“<=50K”)

plt.show()

fig, axes = plt.subplots(ncols=1, nrows=1, figsize=(15,10))

fig.subplots_adjust(hspace=.5)

mosaic(df, [‘marital-status’, ‘class’], ax=axes, axes_label=False)

plt.show()

occupation_list = df.groupby(‘occupation’)

occupations = occupation_list.groups.keys()

occupation_salary = []

for occupation in occupations:

occupation_member = df[df[‘occupation’] == occupation]

above_total = sum(occupation_member[‘salary’] == ‘ >50K’)

below_total = sum(occupation_member[‘salary’] == ‘ <=50K’)

occupation_salary.append([occupation, 100 * above_total/(below_total +

above_total)])

occupation_salary_df = pd.DataFrame(occupation_salary, columns = [‘occupat

ion’, ‘per of >50K’])

plt.barh(occupation_salary_df[‘occupation’],occupation_salary_df[‘per of >

50K’])

plt.ylabel(‘Occupation’)

plt.xlabel(‘Percentage of people with a salary > 50K (%)’)

plt.title(‘Percentage of people with a salary > 50K for each occupation ‘,

fontdict = {‘fontsize’ : 20})

plt.show()

train = df

train = train.drop(“capital-loss”, axis=1)

train = train.drop(“native-country”, axis=1)

train = train.drop(“fnlwgt”, axis=1)

train = train.drop(“education”,axis=1)

def get_occupation(x):

if x in [“Exec-managerial”, “Prof-specialty”, “Protective-serv”]:

return 1

elif x in [“Sales”, “Transport-moving”, “Tech-support”, “Craft-

repair”]:

return 2

else:

return 3

def get_relationship(x):

if x == “Own-child”:

return 6

elif x == “Other-relative”:

return 5

elif x == “Unmarried”:

return 4

elif x == “Not-in-family”:

return 3

elif x == “Husband”:

return 2

else:

return 1

def get_race(x):

if x == “Other”:

return 5

elif x == “Amer-Indian-Eskimo”:

return 4

elif x == “Black”:

return 3

elif x == “White”:

return 2

else:

return 1

def get_sex(x):

if x == “Male”:

return 2

else:

return 1

def get_class(x):

if x == “>50K”:

return 1

else:

return 0

def get_workclass(x):

if x == “Without-pay”:

return 7

elif x == “Private”:

return 6

elif x == “State-gov”:

return 5

elif x == “Self-emp-not-inc”:

return 4

elif x == “Local-gov”:

return 3

elif x == “Federal-gov”:

return 2

else:

return 1

def get_marital_status(x):

if x == “Never-married”:

return 7

elif x == “Separated”:

return 6

elif x == “Married-spouse-absent”:

return 5

elif x == “Widowed”:

return 4

elif x == “Divorced”:

return 3

elif x == “Married-civ-spouse”:

return 2

else:

return 1

train[‘workclass’] = train[‘workclass’].apply(get_workclass)

train[‘marital-status’] = train[‘marital-

status’].apply(get_marital_status)

train[‘occupation’] = train[‘occupation’].apply(get_occupation)

train[‘relationship’] = train[‘relationship’].apply(get_relationship)

train[‘race’] = train[‘race’].apply(get_race)

train[‘sex’] = train[‘sex’].apply(get_sex)

train[‘class’] = train[‘class’].apply(get_class)

test=pd.read_csv(“/content/adult.data”, header=None, sep=”, “)

feature = train.iloc[:, :-1]

labels = train.iloc[:, -1]

feature_matrix1 = feature.values

labels1 = labels.values

train_data, test_data, train_labels, test_labels = train_test_split(featur

e_matrix1, labels1, test_size=0.2, random_state=42)

transformed_train_data = MinMaxScaler().fit_transform(train_data)

transformed_test_data = MinMaxScaler().fit_transform(test_data)

mod=LogisticRegression().fit(transformed_train_data,train_labels)

test_predict=mod.predict(transformed_test_data)

acc=accuracy_score(test_labels, test_predict)

f1=f1_score(test_labels, test_predict)

prec=precision_score(test_labels,test_predict)

rec=recall_score(test_labels, test_predict)

print(“%.4ft%.4ft%.4ft%.4ft%s” % (acc, f1, prec, rec, ‘Logistic Regres

sion’))

factorC = 2000

df[‘capitalGainBin’] = df[‘capital-gain’] / factorC

df[‘capitalGainBin’] = df[‘capitalGainBin’].apply(np.ceil)

df[‘capitalGainBin’] = df[‘capitalGainBin’] * factorC

capitalGainBin_list = df.groupby(‘capitalGainBin’)

capitalGainBins = capitalGainBin_list.groups.keys()

capitalGainBin_salary = []

for capitalGainBin in capitalGainBins:

capitalGainBin_member = df[df[‘capitalGainBin’] == capitalGainBin]

above_total = sum(capitalGainBin_member[‘salary’] == ‘ >50K’)

below_total = sum(capitalGainBin_member[‘salary’] == ‘ <=50K’)

capitalGainBin_salary.append([capitalGainBin, 100 * above_total/(below_t

otal + above_total)])

capitalGainBin_salary_df = pd.DataFrame(capitalGainBin_salary, columns = [

‘capital-gain’, ‘per of >50K’])

plt.scatter(capitalGainBin_salary_df[‘capital-

gain’],capitalGainBin_salary_df[‘per of >50K’])

plt.xlabel(‘Capital-gain’)

plt.ylabel(‘Percentage of people with a salary > 50K (%)’)

plt.title(‘Percentage of people with a salary > 50K for different capitl g

ains’, fontdict = {‘fontsize’ : 20})

plt.show()

plt.hist(df[‘capital-loss’])

plt.xlabel(‘Capital-loss’)

plt.ylabel(‘Count’)

plt.title(‘Distribution of capital-loss’, fontdict = {‘fontsize’ : 20})

plt.show()

factorA = 100000

df[‘wgtBin’] = df[‘fnlwgt’]/factorA

df[‘wgtBin’] = df[‘wgtBin’].apply(np.ceil)

df[‘wgtBin’] = df[‘wgtBin’]*factorA

plt.hist(df[‘wgtBin’])

plt.xlabel(‘Final weight’)

plt.ylabel(‘Count’)

plt.title(‘Distribution of final weight’, fontdict = {‘fontsize’ : 20})

plt.show()

wgtBin_list = df.groupby(‘wgtBin’)

wgtBins = wgtBin_list.groups.keys()

wgtBins_salary = []

for wgtBin in wgtBins:

wgtBin_member = df[df[‘wgtBin’] == wgtBin]

above_total = sum(wgtBin_member[‘salary’] == ‘ >50K’)

below_total = sum(wgtBin_member[‘salary’] == ‘ <=50K’)

wgtBins_salary.append([wgtBin, 100 * above_total/(below_total + above_

total)])

wgtBins_salary_df = pd.DataFrame(wgtBins_salary, columns = [‘fnlwgt’, ‘per

of >50K’])

plt.scatter(wgtBins_salary_df[‘fnlwgt’],wgtBins_salary_df[‘per of >50K’])

plt.xlabel(‘Final weight’)

plt.ylabel(‘Percentage of people with a salary > 50K (%)’)

plt.title(‘Percentage of people with a salary > 50K for different final we

ight’, fontdict = {‘fontsize’ : 20})

plt.show()

plt.hist(df[‘hours-per-week’])

plt.xlabel(‘Hours-per-week’)

plt.ylabel(‘Count’)

plt.title(‘Distribution of hours-per-week’, fontdict = {‘fontsize’ : 20})

plt.show()

factorB = 10

df[‘hours_per_weekBin’] = df[‘hours-per-week’] / factorB

df[‘hours_per_weekBin’] = df[‘hours_per_weekBin’].apply(np.ceil)

df[‘hours_per_weekBin’] = df[‘hours_per_weekBin’] * factorB

hours_per_weekBin_list = df.groupby(‘hours_per_weekBin’)

hours_per_weekBins = hours_per_weekBin_list.groups.keys()

hours_per_weekBin_salary = []

for hours_per_weekBin in hours_per_weekBins:

hours_per_weekBin_member = df[df[‘hours_per_weekBin’] == hours_per_wee

kBin]

above_total = sum(hours_per_weekBin_member[‘salary’] == ‘ >50K’)

below_total = sum(hours_per_weekBin_member[‘salary’] == ‘ <=50K’)

hours_per_weekBin_salary.append([hours_per_weekBin, 100 * above_total/

(below_total + above_total)])

hours_per_weekBin_salary_df = pd.DataFrame(hours_per_weekBin_salary, colum

ns = [‘hours-per-week’, ‘per of >50K’])

plt.scatter(hours_per_weekBin_salary_df[‘hours-per-

week’],hours_per_weekBin_salary_df[‘per of >50K’])

plt.xlabel(‘hours-per-week’)

plt.ylabel(‘Percentage of people with a salary > 50K (%)’)

plt.title(‘Percentage of people with a salary > 50K for different hours-

per-weeks’, fontdict = {‘fontsize’ : 20})

plt.show()