Apache Spark Review
Not like Flink which is short and sharp, Spark acts as huge eco-system which covers from rpc, data storage to computation, scheduler…
I am ganna talk about Spark Core, Spark SQL, Spark Streaming instead of all its features at very short length.
Mind: this article is based on Spark 3.0.0
The first time when I met spark is that one of my colleague told me that
Spark is written by scala when I was studying groovy as a domain specific language and know a little about scala.
I felt shame that i knew such a little about my professions.
And then I began get more and more about scala and spark.
In my mind, Data Scientist , Analyst, Researcher is the main consumer of this product, they do job like Machine Learning, NLP, speech processing and so on, however, they don’t care how it works.
As an engineer instead of a researcher, I focus more on the components, extensions, optimization, runing mode … of Spark platform. for examples:
And First of all, I have to told u, I won’t write it long:
- There is too much to write (almost a book)
- I wanna talk what I am good at
- I ignore entry level stuffs, like how to run a demo, how deploy it on docker, k8s
Maybe I will write some demos, projects, but not this time.
Yes, of Course, I need talk about
Spark Core first, ‘cus’ u can see, the other 5 componens(libs) is constructed under
As I have worked for Basic Platform & Framework Department for a long while, I know that almost every framework needs this basic services, components, sub-systems, and
Spark Core isn’t an exception.
To put it short, I made a mind map as below:
- Spark Conf: how application config (instead of spark node, master of slave)
- Spark Env: essential environment of spark itself
- Spark Context: It contains a lot informations like app’s network, deployment method, data source, sink, computation type and so on. It is the entry of application. Most of devepers api are provided by this context.
- RPC: how component comunicate. It is based on netty and provide syn and async method.
- Listener Bus: A Observer Pattern implements used by upper components to notify or call each other asynchronously.
- MetricsSystem: Evaluation index of running status of Clusters
Storage System funcs as a manager of data storage methods and location (memory or file system).
Computation Engine is composed of memory manager, task manager, task and etc.
DAGScheduler(every job is treated as a Directed Acyclic Graph),
TaskScheduler. The former create
Job , put rdd of each job(DAG) into different
Stage, create related
Task of stage, and submit Tasks, the latter is responsible for scheduling task by FIFO or FAIR or other algorithms.
This is not much to say if we don’t actually use it.
Spark SQL provides us with a easy way to deal(CRUD) with data in cluster case just as the Traditional SQL. And besides, it supports Hive SQL.
The Usage is something like below:
String jsonPath = "./person.json";
When using dataset, we may need to deal with transfomations of
Spark RDD and
In Other Platform, like
Storm, we may put offline data to MapReduce and put streaming or timely data to Storm, but in Spark, we do it both because
Spark Streaming enables us.
It make computating timely data easy and efficient and provide us with batch operations according to (time) window ability.
Structured Streaming which is an extension of Spark SQL engine which treats data as newly comed rows of an endless two-dimensional table, almost all streaming data which is treated as batch data divided by time window are composed by Spark Streaming engine .
However Structured Streaming engine does not work this way, it works like:
Both of this operations are not complicate. The ordination streaming data is all about Dstream and RDD object and the strectured streaming data is about DataSet Object and DataFrame which is treated as a endless table.
Example code like:
// a serials of data loaded as DataFrame
Other components like GraphX, R, MLib, sorry, I know it but used very little.
And other stuff like its cluster archs (Cluster Manager, Worker, Eecutor, Driver, Application),
Or RDD’s dividing and combination, submit, excute,
Let us leave it to other articles like
<How we deal with RDD in Spark>.
Yes, I did some, but i’ rather u read it urself cus they are demos which is too easy.
See: https://github.com/yanyueio/code_base, spark.