Unified Analytics Engine
spark.apache.org
1
Leaving SiteNav
External Link Disclaimer
You are about to visit spark.apache.org. This website is not operated by us. We are not responsible for its content or privacy practices.
About this website
A unified analytics engine for large-scale data processing, originally developed at UC Berkeley in 2009 and now an Apache Software Foundation top-level project used by thousands of organizations processing petabytes of data daily. The engine achieves performance up to 100 times faster than Hadoop MapReduce for certain workloads through in-memory computation, where intermediate results are cached in distributed memory across cluster nodes rather than written to disk between processing stages. The core abstraction is the Resilient Distributed Dataset, an immutable distributed collection of objects that can be processed in parallel and automatically reconstructed from lineage information if partitions are lost due to node failure. DataFrames and Datasets provide higher-level APIs with Catalyst query optimization, where the optimizer analyzes operations and generates optimal physical execution plans with predicate pushdown, column pruning, join reordering, and adaptive query execution that adjusts plans at runtime based on intermediate results. Structured Streaming enables continuous processing of real-time data streams using the same DataFrame API, with exactly-once semantics, event-time windowing, and watermarking for handling late data. MLlib provides distributed machine learning algorithms including classification, regression, clustering, collaborative filtering, and dimensionality reduction, with pipeline APIs for feature extraction, transformation, and model selection. GraphX enables graph computation with operators for subgraph extraction, vertex and edge aggregation, and graph algorithms including PageRank and connected components. The engine integrates with diverse data sources including HDFS, S3, Azure Data Lake, Cassandra, HBase, Hive, Parquet, ORC, JSON, and JDBC databases. Cluster managers include standalone, YARN, Mesos, and Kubernetes, with dynamic resource allocation scaling executors based on workload demands.
Statistics
1
Views
0
Clicks
0
Like
0
Dislike