Summer Offer. Get UpSkill for Big Sale up to -50% off.

Tech Blogs and Updates | Pivotalsoft Vizag

11 May 2025

Ramya.M

WHAT IS APACHE SPARK

Apache Spark concepts is a framework which is open source and has cluster computing and is a part of Hadoop. For the programming of the whole cluster, the spark is giving an interface between the implicit data parallelism and fault tolerance

APACHE SPARK ARCHITECTURE INCLUDES

SPARK CORE:In apache spark documentation the work of the whole project spark is said to be the foundation to it. It provides all the information about the dispatching of the distributed task, scheduling, basic functionality of I/O, and this can be exposed through the interface of an application which is cornered in the RDD abstraction and this usually the Java API which is required for the JVM language. For the functional and higher model of programming, there is an interface already present and the driver program will always solicit the parallel operations such as the mapping, filtering or reducing on the RDD by passing off the function to Spark and getting parallel to the cluster it executes the function schedules. This includes all the operations and some of the additional operations such as joins which will initially take RDD as an input and give the result as more RDD’s. Keeping a track on the lineage the will result to fault tolerance whereas RDD’s can’t be muted and are very lazy while operating.

SPARK SQL: Apache spark in hadoopSQL comes up before of spark core and it introduced DataFrames which is a type of data abstraction. And this gives up a support for all the data those are structured and semi-structured. DSL or the domain specific language it is provided by Spark SQL and it is basically used to manipulate all the data frames in Scala, python data science, and Java. It even if provides all type of support in SQL language which contains the interface with the server ODBC and JDBC. But the compile-time checking is missing in the DataFrames which is afforded by RDDs as per the Spark version 2.0 while the data those are typed strongly spark SQL supports this fully.

SPARK STREAMING:the whole apache spark tutorial contains a lot of things and apache kafka and apache spark is a part of Hadoop. The fast scheduling capabilities to do all those streaming analytics spark core is being used by spark streaming. Then it takes on all the data in small batches and does all the transformation in RDD in that small batches only.
The codes that are written for the analytics this designs enables all the codes that are same, which gives easy access and implementation on the lambda architecture. And the convenience like this comes up with the penalty of latency on the duration spent on the small batches.

Machine learning library MLlib: on the top of the best apache spark techniques core there is spark MLlib and it is a framework that is distributed for machine learning and the memory that is related to the Spark architecture is a huge concept which is based on distributed memory. And this is as fast as the disk-based implementation which is known as Apache Mahout and as much as nine times fast as this and even this scales far better than Vowpal Wabbit. GRAPHX: GraphX is the framework that is used in graph processing distribution as it is upon Apache Spark. And the graphs can’t be muted similarly like the RDDs that is why graphs don’t suite for the GraphX which are required for updating whereas while having the transaction it is left alone like the graph database.