Vaishnavii Paramashivam

710 S Lytle Street · Chicago, IL 60607 · (312)-934-8621 · vaishnavii.pp@gmail.com

I am a Business Analytics graduate student at the University of Illinois - Chicago, with a graduation date of May 2021. I am seeking full-time Data Scientist/Data Analyst & other related positions.

Currently, I work as a Data Science Intern at Cloudbakers, Chicago. I also work as a Data Scientist at the UIC College of Dentistry, Chicago. Prior to joining UIC, I have worked as a Data Analyst at 'Verizon' and 'Robert Bosch' for 3+ years. Over the course of my professional career, I have developed a strong and varied expertise in techniques like SQL, Datawarehousing, ETL pipelining, Machine Learning, Predictive Analytics, Developing Business Workflows, Reporting, Consulting & Agile Software Development across IT & Automotive domains.

I am passionate about empowering businesses with real-time updates in the emerging machine learning space and support analytical decisions by implementing cost and time efficient solutions. My 4+ years of professional career in the data field has made me proficient in Python, R, SQL, PowerBI, Tableau, Statistics, Machine Learning, Deep Learning, Predictive Analytics, Big Data, Cloud Computing & Visualization



Education

University of Illinois - Chicago

Master of Science in Business Analytics

    Courses:
  • Data Mining
  • Statistical Methods and Models for Business Analytics (Machine Learning)
  • Advanced Database Management Systems
  • Business Data Visualization
  • Statistics for Management
  • Time Series and Forecasting
  • Analytics for Big Data
  • Machine Learning Deployment
  • Analytics Strategy & Practice
  • Healthcare Analytics
  • Operations Management
  • Supply Chain Management

August 2019 - May 2021 (Expected)

Anna University

Bachelor of Engineering in Computer Science and Engineering

    Courses:
  • Data Structures & Algorithms
  • Database Management Systems
  • Engineering Mathematics
  • Computer Networks
  • Internet Programming
  • Artificial Intelligence
  • Software Engineering
  • Optimization Techniques

August 2012 - May 2016

Professional Experience

Data Science Intern

Cloudbakers, Chicago

    Tools Used: LSTM, Azure Data Factory, Azure Databricks, Google Bigquery

  • Developing a data science application that predicts real-time health indication of Git repositories and sends periodic alerts.
  • Designing an automated data pipeline using Azure Data Factory that extracts large training datasets from Google Bigquery, pre-processes the data and trains LSTM predictive model on continuous time-series data using Azure Databricks.

Jan 2021 - Present

Data Scientist

University of Illinois College of Dentistry, Chicago

    Tools Used: Python, NLP, Tableau, Time Series Forecasting

  • Automated auditing of support tickets by extracting and analyzing large unstructured data using techniques like NLP, Topic Modelling and Text mining in Python that increased operational efficiency by 34%, reduced inventory costs by 28%.
  • Developed a Time Series Forecasting model in R with RMSE of 1.302 to forecast the weekly, monthly, quarterly and yearly demand of network drive usage that reduced server operating costs by 12%.
  • Designing & implementing data models, interactive dashboards, and visualizations for ad-hoc reporting in Tableau

June 2020 - Present

Data Analyst

Robert Bosch Engineering and Business Solutions,India

    Tools Used: SQL, ETL Pipelines, SSIS, Python, Machine Learning, Statistical Analysis, Data Visualizations

  • Led a cross-functional team of 3 to enhance data retrieval, manipulation & storage. Aided in training and mentoring of freshers.
  • Collaborated with clients, gathered requirements, developed workflows, ETL pipelines using SSIS, importing data from Oracle and Teradata
  • Managed existing data marts, fact & dimension tables, developed queries to fetch results based on user interaction with in-house application
  • Delivered multiple reports in Power BI to internal stakeholders and senior management by conducting data mining, data wrangling & exploratory data analysis using Python. Helped identify process compliance risks, track ongoing defects saving ~3M dollar revenue for RBEI
  • Project Management: Conducted Scrum meetings, used Agile SDLC methods, conducted code reviews, delivery planning & estimation.

Nov 2017 - June 2019

Data Analyst

Verizon Data Services, India

    Tools Used: SQL, PL/SQL, XML, Clarity PPM, Gel Scripting, Java, Jenkins, PowerBI, Jaspersoft, JIRA

  • Created process workflows for internal business customers using PL/SQL code, complex SQL queries involving triggers, functions, joins & procedures, and integrated them with the Clarity project management tool using XML & gel script
  • Improved the efficiency of large-scale business processes in Clarity by 21% by optimizing SQL queries to enable faster retrieval of records
  • Designed periodic dashboards, ad-hoc custom reports using SQL, Excel & Power BI to monitor KPIs saving 45 man-hours per week
  • Created an automated CI/CD pipeline tool ‘DevOps Suite’ using Java & Jenkins, which helped in reducing overall lead time by 35%.

July 2016 - Nov 2017

Volunteer Experience

Ed-Support Volunteer

Make A Difference,India



  • Devised and implemented Maths and Science lesson plans for Class 10 students in shelter homes
  • Mentor children from ages 10-13 to help develop foundational skills like functional numeracy, literacy, self-confidence and exposure
  • Fund raised INR 3,00,000 for women children education in India
  • Organize rallies and child awareness events

Nov 2015 - Dec 2016

Skills

Programming/Scripting

  • Python
  • R
  • SQL
  • C
  • Java
  • HTML
  • CSS
  • JavaScript
  • PHP

Data Visualization

  • Tableau
  • Power BI
  • Jaspersoft

Machine Learning

Supervised Techniques

  • Regression - Linear/Logistic
  • Decision Trees
  • Bagging - Random Forest
  • Boosting - Gradient Boosting/Ada Boosting/li>
  • Naive Bayes
  • K - Nearest Neighbours
  • Support Vector Machines (SVM)
  • Neural Networks

Unsupervised Techniques

  • Clustering - K-Means, Hierarchial
  • Association Rules
  • Principle Component Analysis

Statistical Techniques

  • Hypothesis Testing
  • ANOVA
  • Chi Square
  • t - test
  • Descriptive Statistics
  • Exploratory Data Analysis
  • Confidencce Intervals

Big Data

  • Hadoop
  • HIVE
  • MapReduce
  • Spark
  • Pig
  • MongoDB

Database Tools

  • MS - Access
  • Microsoft SQL Server Management Studio
  • Oracle
  • MySQL Workbench

Software Testing

  • Selenium
  • A/B Testing
  • Regression Testing
  • Junit Testing

Other Tools

  • AWS
  • Git
  • Clarity
  • Mathworks Polyspace
  • Jenkins CI/ CD
  • Docker
  • JIRA
  • MS - Excel

Packages

  • R - dplyr, tidyverse, tidymodels, glmnet, caret, randomForest, ggplot2, survival
  • Python - numpy, pandas, geopandas, pyplot,pymysql,sklearn, seaborn, tensorflow, Keras, MLlib, SparkML, PyTorch

Projects

ETL Process – Data Analysis on Live Music Events in UK

    Technology/Concepts Used: Python, PostgreSQL, Tableau, PySpark, Data Analysis, Web Scraping, Selenium

    Extraction
  • Large data of 1M records of artists' information, stream counts, event, venue information etc were extracted from Spotify, Musicbrainz and Songkick
  • Transformation
  • The collected data was cleaned using NLTK package, aggregated and transformed to desired data model
  • Load
  • The csv files were loaded into the stagingDB in PostgreSQL to draw actionable insights
  • Analysis
  • Visually analyzed the final loaded data using Tableau to derive inferences on artists' popularity based on their event popularity, a perfect venue to host, date and hour wise event and stream counts.

Retrieval Based Customer Interaction Chatbot in Python using Tensorflow, Keras, Pickle

    Technology/Concepts Used: Recurrent Neural Network(LSTM), Natural Language Processing, Stochastic Gradient Descent, Tensorflow, Keras, Pickle, Tkinter, Python

    Dataset:
  • The data file is a JSON file which contains pre-defined patterns to be found and responses to be returned
  • Data Transformation:
  • The text data was tokenized and lemmatized using the NLTK package
  • Input & Output:
  • The input to the Bot was the pattern and the output was the class to which the input pattern belonged. The text is converted to numbers for machine understandability
  • Data Model:
  • Developed a deep neural network with 3 layers using the Keras Sequential API. Accuracy improved to 100% on training up to 200 epochs
  • Chat Bot GUI:
  • The trained model received the input from the user and predicted the response from the Bot in a Graphical User Interface implemented using Tkinter library

Identification of Medical Misinformation on Cannabis Use in Twitter using Python

    Technology/Concepts Used: Python, R, Tableau, NLTK, Sentiment Analysis, Google AutoML

    Data Extraction:
  • Tweets related to Cannabis were collected Tweets using ‘GetOldTweets3’ API based on keyword search method
  • Data Transformation:
  • The collected tweets were cleaned & transformed using NLTK package
  • The tweets making medical claims were filtered by bag-of-words approach
  • Data Model:
  • Developed an application module using Google AutoML that classified the medical claims into Proven Facts/Medical Misinformation.
  • Built a classifier to identify the Twitter Bots and performed a detailed study on the behavior of these Twitter Bots in making false medical claims, promotional claims etc., and created graphs visualizing these factors in Tableau
  • Data Analysis:
  • 99% of tweets with medical claims data collected were identified to be false claims on Cannabis Use
  • The twitter bots formed network of bots and made unsubstantiated health claims trying to advertise Cannabis products

Real-time accident risk prediction on traffic and studying accident hot-spot locations in the US using Python and Tableau

    Technology/Concepts Used: Python, R, Tableau, Statistical Analysis, Decision Trees, KNN, Multinomial Logit Regression, SVM

    Data Collection:
  • Real-Time traffic data was collected from traffic APIs & data providers streaming live traffic
  • Data Model:
  • Built an “Accident Risk Prediction” model using machine learning algorithms such as Decision Trees, KNN, Multinomial Logit Regression and SVM to predict the severity of the accidents on traffic in the US, in which SVM performed better than baseline KNN model, which improved from 67% to 80% F1-Score at 100% Precision
  • Data Analysis:
  • Identified the real time hot-spot locations of accidents & key factors contributing to the most severe accidents and visualized these factors in Tableau

Text Mining of Online Physician Reviews to determine crucial aspects of Consultation

    Technology/Concepts Used: Topic Modelling, Natural Language Processing, Text Analytics, Sentiment Analysis, Google AutoML, Web scraping, Statistical Analysis, Python, R, Tableau, Selenium

    Data Extraction:
  • Using Selenium web drivers and Beautiful Soup, reviews for all the obstetrics and gynaecology physicians in Illinois were extracted
  • Data Transformation:
  • Natural Language Toolkit (NLTK) was used split the reviews into phrases for each of the doctors. For each phrases, sentiment analysis was performed to idententify the sentiment of the phrase
  • Using semi-supervised Latend Dirichlet Alocation approach and Bag of Words Approach, the topic for each of the phrases was identified
  • Built a series of Machine Language Models to identify and determine the crucial aspects of Consultation based on the user reviews on websites like Healthgrades and Ratemds
  • Data Analysis:
  • The extracted data was then transformed using the above approaches to create a dataset which was then used to perform statistical analysis
  • Linear Regression model was developed with Demographics information of the physician, various review characteristics against the overall ratings to determine the key predictor of the Physicians Star Rating
  • It was identified that among Medical Expertise, Bedside Manners, Consultation Cost, Office Environment, Office Staff, Ease of Scheduling, and Waiting time, the Bedside Manners contributed highly to the overall rating of the physicians

Recommendation System for Amazon Fine Foods using PySpark, Python

    Technology/Concepts Used: Big Data, Collaborative Filtering, Matrix Factorization, Machine Learning, Python, PySpark, Tableau

    Data:
  • The dataset consisted of about 500k user reviews on Amazon Fine Foods spanned over 10 years
  • Data Model:
  • Big Data Algorithms such as Distance based User Collaborative filtering and Alternative Least Square Matrix Factorization method were used to identify similar products based the user's review ratings.
  • Similarly, for the users with no prior history of review ratings, popularity-based recommendation system was used
  • Model Evaluation:
  • Root Mean Square Error was used as the evaluation metric in which Alternative Least Square Matrix Factorization method performed better at a RMSE of 1.204

End-to-end ETL Project to determine correlation between swear words used & movie rating using Python

    Technology/Concepts Used: Python, MySql, Tableau, NLTK, Beautiful Soup

    Extraction
  • The movie script, movie rating, metascore etc were collected from IMSDB website and OMDB api
  • Transformation
  • The collected tweets were cleaned using NLTK package, aggregated, filtered and imported to MySql database
  • Load
  • The data from the MySql database was loaded into Tableau Desktop to draw actionable insights
  • Analysis
  • The analysis showed that movies with highest swear words had decent IMDB rating(6-8), however the average threshold of bad words audience tolerated ranged between 50-100. On avg, Indian movies had lowest bad words count and highest IMDB ratings

Evaluating the impact of ‘COVID-19’ worldwide and in the US using PowerBI

  • Dataset The worldwide COVID-19 dataset was taken from WHO and the US COVID-19 dataset was aggregate data from CDC
  • Created visual dashboards in PowerBI that provided visual overview of spread of COVID-19 worldwide and in the US
  • Evaluated the trend of increase in the total confirmed cases and total deaths country wise, region wise and in the states and counties of the US

Evaluating the on-time performance of ‘Envoy Air’ using Tableau

  • Dataset The dataset was extracted from American Bureau of Transportation Statistics containing over 1.3 million records of real-time delay/on-time performances of different airlines
  • Created interactive visual dashboards and stories in Tableau that provided visual overview of Envoy Air’s key performance features
  • Evaluated the on-time performance of ‘Envoy Air’ against every other commercial airline operating in the US using the real time data of airlines published by the American Bureau of Transportation Statistics (BTS)
  • Suggested profitable routes for ‘Envoy Air’ using Network Graphs and statistical analysis

Prognosis of Prostate Cancer by identifying Survival Rate using Survival Analysis and Cox Regression

    Technology/Concepts Used: Survival Analysis, Cox Regression, Logistic Regression, Chi-Square test, t-test, Python, R, Microsoft Excel

  • The objective of the project is to determine the patients who will survive 7 years from the time of their initial diagnosis of Prostate Cancer considering various other factors
  • The dataset contained records of 6000 patients diagnosed with Prostate Cancer which was cleaned and performed Chi-Square Test, T-Test and corr-test to get initial insights about the independent and dependent variables
  • The key parameters that strongly determine 7-year survival of the patients diagnosed with Prostate Cancer was identified using Logistic Regression
  • Logistic Regression was then used to identify key variables that contribute to the development of Chronic Kidney Disease
  • Created a “Survival Rate Prediction” model using Survival Analysis, identified the risk factors causing Prostate Cancer and investigated in detail the effect of these risk factors upon the Survival Rate using Cox Proportional-Hazards Regression.

Screening tool to identify patients at risk for Chronic Kidney Disease using Python, R

    Technology/Concepts Used: Logistic Regression, Python, R, Microsoft Excel

  • The objective was to use statitical analysis to identify the key parameters of Chronic Kidney Disease upon which this model can be used a screening tool to identify patient's risk for CKD
  • A dataset with patient demographics and clinical parameters of around 8000 patients was cleaned to address null values and ouliers
  • The issue of class imbalance was addressed by stochastic sampling method
  • Logistic Regression was then used to identify key variables that contribute to the development of Chronic Kidney Disease
  • Since recall of the model was more important than accuracy and precision, receiver operating characteristic (ROC) curve was used to pick the right threshold for probablity

"Propensity to Respond" Tool to Predict the Likelihood of Customer Behavior at VMWare using R

    Technology/Concepts Used: Machine Learning, Synthetic Minority Oversampling Technique (SMOTE), Python, R, Microsoft Excel

  • Dataset – The target class of dataset was imbalanced and Synthetic Minority Oversampling Technique was used to balance the dataset
  • Data Transformation – Regularized L1 Logistic Regression (LASSO Regression) algorithm was used for variable reduction, reducing the number of variables from 707 to 90
  • Data Model – Machine Learning models such as Random Forest, Gradient Boosting, Ridge Regression and Model Stacking were used to develop “Propensity to Respond” Model to predict the likelihood of consumer behavior at VWWare
  • Model Evaluation - Micro-Average F-Score was used as an evaluation metric and XGBoost had 100% Micro-Average F-Score. Therefore, XGBoost was used to create the “Propensity to Respond” Model


Awards & Certifications

  • Azure Data Engineer - DP 200
  • Tableau Certified Desktop Specialist
  • Tableau Data Scientist issued by Tableau
  • Certified AWS Associate Cloud Developer - Udemy
  • Gold Badge on SQL Programming - Hackerrank
  • Certification in Machine Learning - Coursera
  • ISTQB Certified Agile Tester Certification
  • ISTQB Certified Foundation Level Tester Certification
  • 'Best Performing Employee' by Verizon Data Services for forth quarter in 2016.
  • 'Distinction Award' by Anna University.
  • Certified German Language Professional.

Interests

  • Machine Learning
  • Deep Learning
  • Predictive Analytics
  • Natural Language Processing
  • ETL Frameworks
  • Data Visualization

Apart from being a data woman, I enjoy most of my time experimenting with food, trying different cuisines and experimental baking and cooking.

I also follow a number of fantasy and medieval drama genre movies and television shows, I am an avid animal lover and have volunteered at several animal shelters, and in addition to that I spend a large amount of my free time exploring the latest technology advancements.