Learning PySpark

Watch Learning PySpark

  • 2018
  • 1 Season

Learning PySpark from Packt Publishing is an instructional course that delves into the basics of PySpark and its practical applications in the big data world. This course aims to provide viewers with an in-depth understanding of PySpark, a tool that facilitates the processing of big data with Python. With the help of this course, viewers can develop their skills in using PySpark to extract insights from vast amounts of data, and consequently make informed decisions based on this data.

The course begins with an introduction to PySpark and its architecture. This includes an explanation of PySpark's underlying components, such as SparkContext, SparkSession, and RDD. Viewers will be taught how to create and manipulate RDDs using PySpark's APIs. There is also a section dedicated to Apache Spark's data processing engine, including a tutorial on how to handle data using DataFrames and SQL.

To use PySpark efficiently, it is essential to understand PySpark's computational model. The course covers this in detail, covering topics such as the transformation and action operations. Viewers will learn how transformations are used to convert a source RDD into a new RDD, while actions are used to trigger the computations in a PySpark program. There is also a section dedicated to PySpark's built-in machine learning framework, MLlib. This section provides a comprehensive introduction to MLlib and how its algorithms can be used to address common machine learning use cases.

The course also covers some of the challenges that can arise while working with big data. This includes an explanation of PySpark's fault-tolerance mechanism and the role of Hadoop Distributed File System (HDFS) in PySpark's distributed computing environment. Viewers will also learn how PySpark's cache mechanism can be used to optimize computation efficiency.

Learning PySpark from Packt Publishing is designed to be interactive and hands-on. The course includes multiple exercises and projects that provide viewers with a chance to apply their knowledge of PySpark to real-world scenarios. One such exercise focuses on using PySpark to process aviation data records to Determine flight delays. There is also a project that teaches viewers how to collect and process data from social media using PySpark. This approach helps encourage a deeper understanding of PySpark's capabilities and its practical applications.

Another critical area of focus in the course is the PySpark workflow. This includes a comprehensive tutorial on how to create and manage PySpark projects. In addition, viewers will learn how to use PySpark's Python notebooks for data exploration and visualization. There is also a section dedicated to PySpark's deployment process, including instructions around packaging PySpark applications into a distributable format.

Throughout the course, the instructors provide detailed explanations of the concepts and examples, making it easy to follow along. The instructors use real-world scenarios to illustrate how PySpark can be used to solve common big data issues, such as data cleaning, analysis, and visualization. This approach helps learners understand the practical relevance of the framework, and look at PySpark beyond its technical details.

In conclusion, Learning PySpark from Packt Publishing is an excellent course designed to provide learners with an overview of PySpark and its practical applications. Whether you are a Python developer looking to expand your skill set or a data scientist looking to work with big data, this course can help you master the fundamentals of PySpark. The course accommodates learners at all levels, starting from those who are new to PySpark right through to those who are already familiar but want to take their skills to the next level.

Learning PySpark is a series that is currently running and has 1 seasons (32 episodes). The series first aired on February 26, 2018.

Filter by Source

Seasons
Repartitioning Data
49. Repartitioning Data
February 26, 2018
In this video, we will learn how to repartition the data.
Pitfalls of UDFs
48. Pitfalls of UDFs
February 26, 2018
In this video, we will discuss the pitfalls of using pure Python user defined functions.
Presenting Data
45. Presenting Data
February 26, 2018
In this video, we will learn how to present data.
Transforming Data
44. Transforming Data
February 26, 2018
In this video, we will learn how to transform data.
Selecting Data
43. Selecting Data
February 26, 2018
In this video, we will learn how to select data from a DataFrame.
Aggregating Data
42. Aggregating Data
February 26, 2018
In this video, we will learn how to aggregate data.
Filtering Data
41. Filtering Data
February 26, 2018
In this video, we will learn how to filter data.
Schema Changes
40. Schema Changes
February 26, 2018
In this video, we will learn how to drop, rename, and handle missing observations.
The .distinct(.) Transformation
39. The .distinct(.) Transformation
February 26, 2018
In this video, we will how to retrieve distinct values from a DataFrame.
Performing Statistical Transformations
38. Performing Statistical Transformations
February 26, 2018
In this video, we will learn how to calculate descriptive statistics in DataFrames.
Joining Two DataFrames
37. Joining Two DataFrames
February 26, 2018
In this video, we will learn how to join two DataFrames.
Creating Temporary Tables
36. Creating Temporary Tables
February 26, 2018
In this video, we will learn how to create temporary views over a DataFrame.
Interacting with DataFrames
33. Interacting with DataFrames
February 26, 2018
In this video, we will discuss different ways of interacting with DataFrames.
Specifying Schema of a DataFrame
32. Specifying Schema of a DataFrame
February 26, 2018
In this video, we will learn how to specify schema of a DataFrame.
Creating DataFrames
31. Creating DataFrames
February 26, 2018
In this video, we will learn how to create DataFrames.
Introduction
30. Introduction
February 26, 2018
In this video, we will provide a brief introduction to Spark DataFrames.
Introducing Actions - Descriptive Statistics
29. Introducing Actions - Descriptive Statistics
February 26, 2018
In this video, we will explore some basic descriptive statistics.
Introducing Actions - Saving Data
28. Introducing Actions - Saving Data
February 26, 2018
In this video, we will explore how to save data from an RDD.
Introducing Actions - .histogram(.)
26. Introducing Actions - .histogram(.)
February 26, 2018
In this video, we will learn how to bin data into buckets.
Introducing Actions - .coalesce(.)
24. Introducing Actions - .coalesce(.)
February 26, 2018
In this video, we will learn when and why to use the .coalesce(...) method instead of the .repartition(...).
Introducing Actions - .foreach(.)
22. Introducing Actions - .foreach(.)
February 26, 2018
In this video, we will learn how to execute an action on each element of an RDD in each of its partitions.
Introducing Actions - .reduce(.) and .reduceByKey(.)
20. Introducing Actions - .reduce(.) and .reduceByKey(.)
February 26, 2018
In this video, we will learn another fundamental method from the Map-Reduce paradigm - the .reduce(...) and the .reduceByKey(...).
Introducing Transformations - .filter(.)
12. Introducing Transformations - .filter(.)
February 26, 2018
In this video, we will learn how to filter data from RDDs.
Understanding Lazy Execution
10. Understanding Lazy Execution
February 26, 2018
Spark is lazy to process data. In this video we will learn why this is an advantage.
Schema of an RDD
9. Schema of an RDD
February 26, 2018
In this video, we explore the advantages and disadvantages of RDD's lack of schema.
Creating RDDs
8. Creating RDDs
February 26, 2018
In this video, we will learn how to create RDDs in many different ways.
Cloning GitHub Repository
6. Cloning GitHub Repository
February 26, 2018
The aim of this video is to clone the GitHub repository for the course. Doing this will set everything we need for the following videos.
Newest Capabilities of PySpark 2.0+
5. Newest Capabilities of PySpark 2.0+
February 26, 2018
The aim of this video is to briefly review the newest features of Spark 2.0+.
Spark Execution Process
4. Spark Execution Process
February 26, 2018
The aim of this video is to briefly review the execution process.
Apache Spark Stack
3. Apache Spark Stack
February 26, 2018
The aim of this video is to provide a brief overview of Apache Spark stack components.
Brief Introduction to Spark
2. Brief Introduction to Spark
February 26, 2018
The aim of the video is to explain Spark and its Python interface.
The Course Overview
1. The Course Overview
February 26, 2018
This video gives an overview of the entire course.
Description
Where to Watch Learning PySpark
Learning PySpark is available for streaming on the Packt Publishing website, both individual episodes and full seasons. You can also watch Learning PySpark on demand at Amazon.
  • Premiere Date
    February 26, 2018