Vaishnavii Paramashivam

710 S Lytle Street · Chicago, IL 60607 · (312)-934-8621 · vaishnavii.pp@gmail.com

I am a Business Analytics graduate student at the University of Illinois - Chicago, with a graduation date of May 2021. I am seeking full-time Data Scientist/Data Analyst & other related positions.

Currently, I work as a Data Science Intern at Cloudbakers, Chicago. I also work as a Data Scientist at the UIC College of Dentistry, Chicago. Prior to joining UIC, I have worked as a Data Analyst at 'Verizon' and 'Robert Bosch' for 3+ years. Over the course of my professional career, I have developed a strong and varied expertise in techniques like SQL, Datawarehousing, ETL pipelining, Machine Learning, Predictive Analytics, Developing Business Workflows, Reporting, Consulting & Agile Software Development across IT & Automotive domains.

I am passionate about empowering businesses with real-time updates in the emerging machine learning space and support analytical decisions by implementing cost and time efficient solutions. My 4+ years of professional career in the data field has made me proficient in Python, R, SQL, PowerBI, Tableau, Statistics, Machine Learning, Deep Learning, Predictive Analytics, Big Data, Cloud Computing & Visualization

Education

University of Illinois - Chicago

Master of Science in Business Analytics

Courses:

Data Mining
Statistical Methods and Models for Business Analytics (Machine Learning)
Advanced Database Management Systems
Business Data Visualization
Statistics for Management
Time Series and Forecasting
Analytics for Big Data
Machine Learning Deployment
Analytics Strategy & Practice
Healthcare Analytics
Operations Management
Supply Chain Management

August 2019 - May 2021 (Expected)

Anna University

Bachelor of Engineering in Computer Science and Engineering

Courses:

Data Structures & Algorithms
Database Management Systems
Engineering Mathematics
Computer Networks
Internet Programming
Artificial Intelligence
Software Engineering
Optimization Techniques

August 2012 - May 2016

Professional Experience

Data Science Intern

Cloudbakers, Chicago

Tools Used:

Developing a data science application that predicts real-time health indication of Git repositories and sends periodic alerts.
Designing an automated data pipeline using Azure Data Factory that extracts large training datasets from Google Bigquery, pre-processes the data and trains LSTM predictive model on continuous time-series data using Azure Databricks.

Jan 2021 - Present

Data Scientist

University of Illinois College of Dentistry, Chicago

Tools Used:

Automated auditing of support tickets by extracting and analyzing large unstructured data using techniques like NLP, Topic Modelling and Text mining in Python that increased operational efficiency by 34%, reduced inventory costs by 28%.
Developed a Time Series Forecasting model in R with RMSE of 1.302 to forecast the weekly, monthly, quarterly and yearly demand of network drive usage that reduced server operating costs by 12%.
Designing & implementing data models, interactive dashboards, and visualizations for ad-hoc reporting in Tableau

June 2020 - Present

Data Analyst

Robert Bosch Engineering and Business Solutions,India

Tools Used:

Led a cross-functional team of 3 to enhance data retrieval, manipulation & storage. Aided in training and mentoring of freshers.
Collaborated with clients, gathered requirements, developed workflows, ETL pipelines using SSIS, importing data from Oracle and Teradata
Managed existing data marts, fact & dimension tables, developed queries to fetch results based on user interaction with in-house application
Delivered multiple reports in Power BI to internal stakeholders and senior management by conducting data mining, data wrangling & exploratory data analysis using Python. Helped identify process compliance risks, track ongoing defects saving ~3M dollar revenue for RBEI
Project Management: Conducted Scrum meetings, used Agile SDLC methods, conducted code reviews, delivery planning & estimation.

Nov 2017 - June 2019

Data Analyst

Verizon Data Services, India

Tools Used:

Created process workflows for internal business customers using PL/SQL code, complex SQL queries involving triggers, functions, joins & procedures, and integrated them with the Clarity project management tool using XML & gel script
Improved the efficiency of large-scale business processes in Clarity by 21% by optimizing SQL queries to enable faster retrieval of records
Designed periodic dashboards, ad-hoc custom reports using SQL, Excel & Power BI to monitor KPIs saving 45 man-hours per week
Created an automated CI/CD pipeline tool ‘DevOps Suite’ using Java & Jenkins, which helped in reducing overall lead time by 35%.

July 2016 - Nov 2017

Volunteer Experience

Ed-Support Volunteer

Make A Difference,India

Devised and implemented Maths and Science lesson plans for Class 10 students in shelter homes
Mentor children from ages 10-13 to help develop foundational skills like functional numeracy, literacy, self-confidence and exposure
Fund raised INR 3,00,000 for women children education in India
Organize rallies and child awareness events

Nov 2015 - Dec 2016

Skills

Programming/Scripting

Python
R
SQL
C
Java
HTML
CSS
JavaScript
PHP

Data Visualization

Tableau
Power BI
Jaspersoft

Machine Learning

Supervised Techniques

Regression - Linear/Logistic
Decision Trees
Bagging - Random Forest
Boosting - Gradient Boosting/Ada Boosting/li>
Naive Bayes
K - Nearest Neighbours
Support Vector Machines (SVM)
Neural Networks

Unsupervised Techniques

Clustering - K-Means, Hierarchial
Association Rules
Principle Component Analysis

Statistical Techniques

Hypothesis Testing
ANOVA
Chi Square
t - test
Descriptive Statistics
Exploratory Data Analysis
Confidencce Intervals

Big Data

Hadoop
HIVE
MapReduce
Spark
Pig
MongoDB

Database Tools

MS - Access
Microsoft SQL Server Management Studio
Oracle
MySQL Workbench

Software Testing

Selenium
A/B Testing
Regression Testing
Junit Testing

Other Tools

AWS
Git
Clarity
Mathworks Polyspace
Jenkins CI/ CD
Docker
JIRA
MS - Excel

Packages

R - dplyr, tidyverse, tidymodels, glmnet, caret, randomForest, ggplot2, survival
Python - numpy, pandas, geopandas, pyplot,pymysql,sklearn, seaborn, tensorflow, Keras, MLlib, SparkML, PyTorch

Projects

ETL Process – Data Analysis on Live Music Events in UK

Technology/Concepts Used:

Extraction

Large data of 1M records of artists' information, stream counts, event, venue information etc were extracted from Spotify, Musicbrainz and Songkick

Transformation

The collected data was cleaned using NLTK package, aggregated and transformed to desired data model

Load

The csv files were loaded into the stagingDB in PostgreSQL to draw actionable insights

Analysis

Visually analyzed the final loaded data using Tableau to derive inferences on artists' popularity based on their event popularity, a perfect venue to host, date and hour wise event and stream counts.

Retrieval Based Customer Interaction Chatbot in Python using Tensorflow, Keras, Pickle

Technology/Concepts Used:

Dataset:

The data file is a JSON file which contains pre-defined patterns to be found and responses to be returned

Data Transformation:

The text data was tokenized and lemmatized using the NLTK package

Input & Output:

The input to the Bot was the pattern and the output was the class to which the input pattern belonged. The text is converted to numbers for machine understandability

Data Model:

Developed a deep neural network with 3 layers using the Keras Sequential API. Accuracy improved to 100% on training up to 200 epochs

Chat Bot GUI:

The trained model received the input from the user and predicted the response from the Bot in a Graphical User Interface implemented using Tkinter library

Identification of Medical Misinformation on Cannabis Use in Twitter using Python

Technology/Concepts Used:

Data Extraction:

Tweets related to Cannabis were collected Tweets using ‘GetOldTweets3’ API based on keyword search method

Data Transformation:

The collected tweets were cleaned & transformed using NLTK package
The tweets making medical claims were filtered by bag-of-words approach

Data Model:

Developed an application module using Google AutoML that classified the medical claims into Proven Facts/Medical Misinformation.
Built a classifier to identify the Twitter Bots and performed a detailed study on the behavior of these Twitter Bots in making false medical claims, promotional claims etc., and created graphs visualizing these factors in Tableau

Data Analysis:

99% of tweets with medical claims data collected were identified to be false claims on Cannabis Use
The twitter bots formed network of bots and made unsubstantiated health claims trying to advertise Cannabis products

Real-time accident risk prediction on traffic and studying accident hot-spot locations in the US using Python and Tableau

Technology/Concepts Used:

Data Collection:

Real-Time traffic data was collected from traffic APIs & data providers streaming live traffic

Data Model:

Built an “Accident Risk Prediction” model using machine learning algorithms such as Decision Trees, KNN, Multinomial Logit Regression and SVM to predict the severity of the accidents on traffic in the US, in which SVM performed better than baseline KNN model, which improved from 67% to 80% F1-Score at 100% Precision

Data Analysis:

Identified the real time hot-spot locations of accidents & key factors contributing to the most severe accidents and visualized these factors in Tableau

Text Mining of Online Physician Reviews to determine crucial aspects of Consultation

Technology/Concepts Used:

Data Extraction:

Using Selenium web drivers and Beautiful Soup, reviews for all the obstetrics and gynaecology physicians in Illinois were extracted

Data Transformation:

Natural Language Toolkit (NLTK) was used split the reviews into phrases for each of the doctors. For each phrases, sentiment analysis was performed to idententify the sentiment of the phrase
Using semi-supervised Latend Dirichlet Alocation approach and Bag of Words Approach, the topic for each of the phrases was identified
Built a series of Machine Language Models to identify and determine the crucial aspects of Consultation based on the user reviews on websites like Healthgrades and Ratemds

Data Analysis:

The extracted data was then transformed using the above approaches to create a dataset which was then used to perform statistical analysis
Linear Regression model was developed with Demographics information of the physician, various review characteristics against the overall ratings to determine the key predictor of the Physicians Star Rating
It was identified that among Medical Expertise, Bedside Manners, Consultation Cost, Office Environment, Office Staff, Ease of Scheduling, and Waiting time, the Bedside Manners contributed highly to the overall rating of the physicians

Recommendation System for Amazon Fine Foods using PySpark, Python

Technology/Concepts Used:

Data:

The dataset consisted of about 500k user reviews on Amazon Fine Foods spanned over 10 years

Data Model:

Big Data Algorithms such as Distance based User Collaborative filtering and Alternative Least Square Matrix Factorization method were used to identify similar products based the user's review ratings.
Similarly, for the users with no prior history of review ratings, popularity-based recommendation system was used

Model Evaluation:

Root Mean Square Error was used as the evaluation metric in which Alternative Least Square Matrix Factorization method performed better at a RMSE of 1.204

End-to-end ETL Project to determine correlation between swear words used & movie rating using Python

Technology/Concepts Used:

Extraction

The movie script, movie rating, metascore etc were collected from IMSDB website and OMDB api

Transformation

The collected tweets were cleaned using NLTK package, aggregated, filtered and imported to MySql database

Load

The data from the MySql database was loaded into Tableau Desktop to draw actionable insights

Analysis

The analysis showed that movies with highest swear words had decent IMDB rating(6-8), however the average threshold of bad words audience tolerated ranged between 50-100. On avg, Indian movies had lowest bad words count and highest IMDB ratings

Evaluating the impact of ‘COVID-19’ worldwide and in the US using PowerBI

Dataset The worldwide COVID-19 dataset was taken from WHO and the US COVID-19 dataset was aggregate data from CDC
Created visual dashboards in PowerBI that provided visual overview of spread of COVID-19 worldwide and in the US
Evaluated the trend of increase in the total confirmed cases and total deaths country wise, region wise and in the states and counties of the US

Evaluating the on-time performance of ‘Envoy Air’ using Tableau

Dataset The dataset was extracted from American Bureau of Transportation Statistics containing over 1.3 million records of real-time delay/on-time performances of different airlines
Created interactive visual dashboards and stories in Tableau that provided visual overview of Envoy Air’s key performance features
Evaluated the on-time performance of ‘Envoy Air’ against every other commercial airline operating in the US using the real time data of airlines published by the American Bureau of Transportation Statistics (BTS)
Suggested profitable routes for ‘Envoy Air’ using Network Graphs and statistical analysis

Prognosis of Prostate Cancer by identifying Survival Rate using Survival Analysis and Cox Regression

Technology/Concepts Used:

The objective of the project is to determine the patients who will survive 7 years from the time of their initial diagnosis of Prostate Cancer considering various other factors
The dataset contained records of 6000 patients diagnosed with Prostate Cancer which was cleaned and performed Chi-Square Test, T-Test and corr-test to get initial insights about the independent and dependent variables
The key parameters that strongly determine 7-year survival of the patients diagnosed with Prostate Cancer was identified using Logistic Regression
Logistic Regression was then used to identify key variables that contribute to the development of Chronic Kidney Disease
Created a “Survival Rate Prediction” model using Survival Analysis, identified the risk factors causing Prostate Cancer and investigated in detail the effect of these risk factors upon the Survival Rate using Cox Proportional-Hazards Regression.

Screening tool to identify patients at risk for Chronic Kidney Disease using Python, R

Technology/Concepts Used:

The objective was to use statitical analysis to identify the key parameters of Chronic Kidney Disease upon which this model can be used a screening tool to identify patient's risk for CKD
A dataset with patient demographics and clinical parameters of around 8000 patients was cleaned to address null values and ouliers
The issue of class imbalance was addressed by stochastic sampling method
Logistic Regression was then used to identify key variables that contribute to the development of Chronic Kidney Disease
Since recall of the model was more important than accuracy and precision, receiver operating characteristic (ROC) curve was used to pick the right threshold for probablity

"Propensity to Respond" Tool to Predict the Likelihood of Customer Behavior at VMWare using R

Technology/Concepts Used:

Dataset – The target class of dataset was imbalanced and Synthetic Minority Oversampling Technique was used to balance the dataset
Data Transformation – Regularized L1 Logistic Regression (LASSO Regression) algorithm was used for variable reduction, reducing the number of variables from 707 to 90
Data Model – Machine Learning models such as Random Forest, Gradient Boosting, Ridge Regression and Model Stacking were used to develop “Propensity to Respond” Model to predict the likelihood of consumer behavior at VWWare
Model Evaluation - Micro-Average F-Score was used as an evaluation metric and XGBoost had 100% Micro-Average F-Score. Therefore, XGBoost was used to create the “Propensity to Respond” Model

Awards & Certifications

Azure Data Engineer - DP 200
Tableau Certified Desktop Specialist
Tableau Data Scientist issued by Tableau
Certified AWS Associate Cloud Developer - Udemy
Gold Badge on SQL Programming - Hackerrank
Certification in Machine Learning - Coursera
ISTQB Certified Agile Tester Certification
ISTQB Certified Foundation Level Tester Certification
'Best Performing Employee' by Verizon Data Services for forth quarter in 2016.
'Distinction Award' by Anna University.
Certified German Language Professional.

Interests

Machine Learning
Deep Learning
Predictive Analytics
Natural Language Processing
ETL Frameworks
Data Visualization

Apart from being a data woman, I enjoy most of my time experimenting with food, trying different cuisines and experimental baking and cooking.

I also follow a number of fantasy and medieval drama genre movies and television shows, I am an avid animal lover and have volunteered at several animal shelters, and in addition to that I spend a large amount of my free time exploring the latest technology advancements.