Data Engineer · Boston, MA

HetalVaghela

Building pipelines that power healthcare & financial decisions at scale. 4+ years engineering real-time streaming, cloud migrations, and ML-powered data platforms.

// live metrics
500K+
Events / sec
9,000+
Store locations
$4.2M
Fraud prevented
47
SSIS migrated
// pipeline flow
Source → Kafka / Auto Loader
Bronze → Silver (DLT / Spark)
Gold → dbt → BI · ML · APIs
Apache Kafka·Apache Spark·AWS Glue·Snowflake·Databricks·dbt·Delta Lake·Python·TensorFlow·XGBoost·HIPAA·FHIR R4·Unity Catalog·Control-M·Apache Kafka·Apache Spark·AWS Glue·Snowflake·Databricks·dbt·Delta Lake·Python·TensorFlow·XGBoost·HIPAA·FHIR R4·Unity Catalog·Control-M
01 — About

Building data systems that actually work

I'm a Data Engineer who thrives at the intersection of complex data problems and elegant engineering. Currently at CVS Health, I architect real-time streaming platforms and healthcare data lakehouse systems processing millions of events daily across 9,000+ pharmacy locations.

My work spans the full data engineering stack — from ingestion and streaming through transformation, governance, and ML deployment — always with a focus on reliability, compliance, and measurable business impact.

// pipeline I build every day
Source
Kafka / Loader
Bronze
Spark / DLT
Silver
dbt / Gold
BI · ML
Current Role
Data Engineer
CVS Health · Boston, MA · Jun 2024 – Present
Specializations
Real-Time Streaming · Cloud Migration
Healthcare & Financial · ML Platforms · HIPAA
Education
M.S. Information Technology
UMass Boston · GPA 3.78 · Graduate Assistantship
Domain Expertise
Healthcare · Financial Services
Credit Risk · Insurance Fraud · E-Commerce
02 — Skills

Technical arsenal

Streaming & Processing
Apache KafkaApache SparkSpark Structured StreamingDelta Live TablesAWS KinesisSchema RegistryApache AirflowAvro / Parquet
Cloud Platforms
AWS (Glue · S3 · EMR · MSK)DatabricksGCP (Dataflow · BigQuery)Azure Data FactoryAWS LambdaAWS Macie / KMSTerraform IaCMWAA
Data Warehousing
Snowflakedbt CoreDelta LakeGoogle BigQueryOracle GoldenGate CDCMetricFlowUnity CatalogAmazon Athena
Languages & Tools
PythonSQLPySparkFastAPIBash / ShellControl-MGitHub ActionsSQLFluff
🤖
Machine Learning
TensorFlow Wide & DeepXGBoostScikit-learnIsolation ForestSHAP ExplainabilityGCP Vertex AIMinHashLSHSMOTE
🏥
Compliance & Governance
HIPAAFHIR R4IFRS 9 / Basel IIIHEDIS MeasuresDelphix MaskingAWS CloudTrailSCD Type 2Row-Level Security
03 — Experience

Where I've made an impact

Jun 2024 – Present · Boston, MA
Data Engineer
CVS Health
  • Architected a real-time Kafka streaming platform ingesting 500K events/sec from 9,000+ pharmacy locations, reducing inventory latency from 6 hours to under 60 seconds
  • Led end-to-end migration of 47 SSIS packages to AWS Glue PySpark with automated HIPAA compliance via KMS, AWS Macie, and CloudTrail — zero audit findings
  • Built Databricks + Snowflake Unified Lakehouse (Medallion Architecture) for clinical analytics across 50+ hospital partners using FHIR R4 and Unity Catalog
  • Developed centralized dbt project with 200+ models, 680+ schema tests, and 45 MetricFlow metrics — cutting data errors by 95%
  • Implemented Delphix data masking and automated CI/CD reducing deployment errors by 25% and processing efficiency by 40%
<60s Latency+30% Data Quality-35% Infra Cost0 HIPAA Findings300+ Analysts
Jun 2018 – Jul 2022 · India
Data Engineer
Magna Infotech
  • Built Snowflake financial data warehouse (Kimball model) replacing Oracle + Hadoop silos — reducing query times by 45% and report accuracy by 35%
  • Implemented Oracle GoldenGate CDC, 120+ dbt models, IFRS 9 Credit Risk and Basel III Finance data marts
  • Deployed TF Wide & Deep (AUC 0.88) + XGBoost + Isolation Forest fraud detection — preventing $4.2M fraud annually with 93% recall
  • Orchestrated containerized ML microservices on GKE with SHAP explainability at 140ms p99 inference latency
  • Integrated Hadoop and Spark with Oracle and CockroachDB, reducing OLAP retrieval times by 50%
-45% Query Time$4.2M Fraud Saved93% RecallAUC 0.88140ms p99
04 — Projects

Enterprise-grade systems shipped

01
PROJECT / 01
CVS Health
Real-Time Pharmacy Inventory & Order Streaming Platform

Event-driven Kafka streaming pipeline ingesting pharmacy data from 9,000+ CVS locations with sub-60-second latency, Avro schema validation, and real-time QuickSight dashboards for 1,200+ users.

<60s Latency500K Events/sec+30% Quality
Apache Kafka (MSK)Spark StreamingDelta LakeAWS EMRQuickSightTerraform
02
PROJECT / 02
CVS Health
SSIS → AWS Glue Migration + HIPAA Automation

Migration of 47 SSIS packages to AWS Glue PySpark with automated HIPAA compliance — KMS, Macie, CloudTrail — and Blue/Green CI/CD via GitHub Actions.

-35% Cost0 HIPAA Findings
AWS GluePySparkKMS + MacieTerraform
05
PROJECT / 05
CVS Health
Databricks + Snowflake Unified Clinical Lakehouse

Medallion Architecture on Databricks for 50+ hospital partners. MinHashLSH MPI for patient identity matching. Zero-copy Delta Sharing to Snowflake.

5-min ADE Alerts94% MPI Precision
DatabricksDLTUnity CatalogFHIR R4Snowflake
06
PROJECT / 06
CVS Health
dbt Healthcare Metrics & Semantic Layer

200+ dbt models, 680+ schema tests, 45 MetricFlow canonical metrics. Slim CI cut build time from 45 min to 7 min with zero dashboard downtime.

-95% Data Errors7-min CI
dbt CoreMetricFlowSnowflakeGitHub Actions
03
PROJECT / 03
Magna Infotech
Snowflake Financial Data Warehouse

Centralized Snowflake DW replacing Oracle + Hadoop silos. GoldenGate CDC, 120+ dbt models, IFRS 9 and Basel III data marts.

-45% Query Time+35% Accuracy
SnowflakeGoldenGatedbtBigQuery
04
PROJECT / 04
Magna Infotech
ML-Powered Financial Forecasting & Insurance Fraud Detection

TF Wide & Deep (AUC 0.88) for credit default + XGBoost + Isolation Forest fraud ensemble with SHAP explainability. Deployed on GKE with HPA autoscaling.

$4.2M Fraud Prevented93% Recall140ms p99AUC 0.88
TensorFlowXGBoostSHAPFastAPIDocker + GKEVertex AI
Personal & Learning

Hands-on builds

P-01
AWS Data Pipeline — S3, Lambda, Glue, Athena, Step Functions
End-to-end cloud pipeline
P-02
Apache Airflow Orchestration Pipelines
DAG design & workflow automation
P-03
Airbnb End-to-End — dbt + Snowflake + AWS
Full ELT pipeline
P-04
Databricks Lakehouse — Spark Declarative Pipelines
Medallion architecture
P-05
Stock Market Real-Time Analysis — Apache Kafka
End-to-end streaming
P-06
SQL Data Warehouse from Scratch
Dimensional modeling
P-07
PAN Card Data Cleaning & Validation
Python & SQL data quality
P-08
Crime Rate Analysis — Tableau Dashboard
UMass Boston
P-09
Cassandra NoSQL Database Project
Distributed systems · UMass Boston
P-10
Python Web Scraping Project
Data extraction · UMass Boston
05 — Education

Academic foundation

Master's Degree
UMass Boston
M.S. Information Technology
Aug 2022 – May 2024
3.78
GPA · Half Tuition Waiver + Graduate Assistantship
Data Management SystemsBusiness IntelligenceBig Data AnalyticsBusiness ProgrammingProject ManagementPredictive Analytics
Bachelor's Degree
Gujarat Technological University
B.E. Information Technology
Jul 2014 – May 2018
8.04
CGPA
Java · Python · C/C++DBMS & SQLData Structures & AlgorithmsBig Data FundamentalsSoftware EngineeringWeb Technologies
06 — Contact

Let's build something remarkable together

Open to new opportunities in data engineering, platform engineering, and ML infrastructure — especially in financial services, healthcare, and high-throughput data environments.