Apache Spark Review

Not like Flink which is short and sharp, Spark acts as huge eco-system which covers from rpc, data storage to computation, scheduler…

I am ganna talk about Spark Core, Spark SQL, Spark Streaming instead of all its features at very short length.


Mind: this article is based on Spark 3.0.0

Spark in my mind

The first time when I met spark is that one of my colleague told me that Spark is written by scala when I was studying groovy as a domain specific language and know a little about scala.

I felt shame that i knew such a little about my professions.

And then I began get more and more about scala and spark.

In my mind, Data Scientist , Analyst, Researcher is the main consumer of this product, they do job like Machine Learning, NLP, speech processing and so on, however, they don’t care how it works.

As an engineer instead of a researcher, I focus more on the components, extensions, optimization, runing mode … of Spark platform. for examples:

Spark Platform

And First of all, I have to told u, I won’t write it long:

  • There is too much to write (almost a book)
  • I wanna talk what I am good at
  • I ignore entry level stuffs, like how to run a demo, how deploy it on docker, k8s

Maybe I will write some demos, projects, but not this time.

Spark Core

Yes, of Course, I need talk about Spark Core first, ‘cus’ u can see, the other 5 componens(libs) is constructed under Spark Core.

As I have worked for Basic Platform & Framework Department for a long while, I know that almost every framework needs this basic services, components, sub-systems, and Spark Core isn’t an exception.

To put it short, I made a mind map as below:

Spark Core

Explains:

  • Spark Conf: how application config (instead of spark node, master of slave)
  • Spark Env: essential environment of spark itself
  • Spark Context: It contains a lot informations like app’s network, deployment method, data source, sink, computation type and so on. It is the entry of application. Most of devepers api are provided by this context.
  • RPC: how component comunicate. It is based on netty and provide syn and async method.
  • Listener Bus: A Observer Pattern implements used by upper components to notify or call each other asynchronously.
  • MetricsSystem: Evaluation index of running status of Clusters

Storage System funcs as a manager of data storage methods and location (memory or file system).

Computation Engine is composed of memory manager, task manager, task and etc.

Scheduler including DAGScheduler(every job is treated as a Directed Acyclic Graph), TaskScheduler. The former create Job , put rdd of each job(DAG) into different Stage, create related Task of stage, and submit Tasks, the latter is responsible for scheduling task by FIFO or FAIR or other algorithms.

Spark SQL

This is not much to say if we don’t actually use it.

Spark SQL provides us with a easy way to deal(CRUD) with data in cluster case just as the Traditional SQL. And besides, it supports Hive SQL.

The Usage is something like below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
String jsonPath = "./person.json";
//1. define a SparkSession
SparkSession spark = SparkSession.builder().appName("Spark SQL simple demo").getOrCreate();
//2. create a DataSet based on json file
Dataset<Row> ds = spark.read().json(jsonPath);
//3.Use it like gerneral sql
//ds.show();
//ds.printSchema();
ds.select("name").show();


/* After u submit task to spark, the result may:
*
* +-------+
* | name|
* +-------+
* | Mike|
* | Bob|
* +-------+
*
*/

When using dataset, we may need to deal with transfomations of Spark RDD and DataSet, DataFrame.

Spark Streaming

In Other Platform, like Hadoop, Storm, we may put offline data to MapReduce and put streaming or timely data to Storm, but in Spark, we do it both because Spark Streaming enables us.

It make computating timely data easy and efficient and provide us with batch operations according to (time) window ability.

source and sink of Streaming Engine

Apart from Structured Streaming which is an extension of Spark SQL engine which treats data as newly comed rows of an endless two-dimensional table, almost all streaming data which is treated as batch data divided by time window are composed by Spark Streaming engine .

batch data divided by time window

However Structured Streaming engine does not work this way, it works like:

image-20200810183320563

Both of this operations are not complicate. The ordination streaming data is all about Dstream and RDD object and the strectured streaming data is about DataSet Object and DataFrame which is treated as a endless table.

Example code like:

1
2
3
4
5
6
// a serials of data loaded as DataFrame
// public Dataset<Row> load(String path)
// doc: Loads input in as a DataFrame, for data streams that read from some path.
Dataset<Row> lines = spark.readStream().format("socket").option(~).load();

//here "lines" represents a DataFrame

The Other Components

Other components like GraphX, R, MLib, sorry, I know it but used very little.

And other stuff like its cluster archs (Cluster Manager, Worker, Eecutor, Driver, Application),

cluster archs

Or RDD’s dividing and combination, submit, excute,

rdd actions

Let us leave it to other articles like <How we deal with RDD in Spark>.

Demos

Yes, I did some, but i’ rather u read it urself cus they are demos which is too easy.

See: https://github.com/yanyueio/code_base, spark.

References

  1. spark docs

  2. cluster mode overview

  3. the road of learning spark - chinese edition

  4. spark: the definitive guide