Install Spark and Hadoop on Windows 10
Goal
Run Apache Spark locally on Windows 10 with Hadoop binaries.
Prerequisites
Verify Java and Scala installations:
java -version
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)
sbt console
Welcome to Scala 2.12.13 (OpenJDK 64-Bit Server VM, Java 1.8.0_292).
Install Spark
Download from https://spark.apache.org/downloads.html:
- Spark release:
3.2.0 - Package type:
Pre-built for Apache Hadoop 2.7
Extract to C:\Spark\spark-3.2.0-bin-hadoop2.7
Set environment variables:
setx SPARK_HOME "C:\Spark\spark-3.2.0-bin-hadoop2.7" /M
setx PATH "%PATH%;%SPARK_HOME%\bin" /M
Install Hadoop
Download Windows binaries from https://github.com/cdarlint/winutils
Extract hadoop-2.7.7 to C:\Hadoop\hadoop-2.7.7
Set environment variables:
setx HADOOP_HOME "C:\Hadoop\hadoop-2.7.7" /M
setx PATH "%PATH%;%HADOOP_HOME%\bin" /M
Additional Resources
Spark Structured Streaming with Kafka:
- https://spark.apache.org/docs/3.1.1/structured-streaming-kafka-integration.html
- https://medium.com/expedia-group-tech/apache-spark-structured-streaming-checkpoints-and-triggers-4-of-6-b6f15d5cfd8d
Parquet Viewer:
Azure Data Lake Gen 2 Integration: