TrueschoTruescho
All Courses
Big Data Analytics Using Spark
edX
Course
Advanced
Free to Audit
Certificate

Big Data Analytics Using Spark

The University of California, San Diego

Learn how to analyze large datasets using Jupyter notebooks, MapReduce and Spark as a platform.

10 hrs/week10 weeksEnglish60,398 enrolled
Free to Audit

About this Course

This course will be retired on April 4, 2026. Last day to enroll is February 2, 2026, at 00:00 UTC. This course is intended only for learners enrolled into the former Data Science MicroMaster's Program In data science, data is called "big" if it cannot fit into the memory of a single standard laptop or workstation. The analysis of big datasets requires using a cluster of tens, hundreds or thousands of computers. Effectively using such clusters requires the use of distributed files systems, such as the Hadoop Distributed File System (HDFS) and corresponding computational models, such as Hadoop, MapReduce and Spark. In this course, part of the Data Science MicroMasters program, you will learn what the bottlenecks are in massive parallel computation and how to use spark to minimize these bottlenecks. You will learn how to perform supervised an unsupervised machine learning on massive datasets using the Machine Learning Library (MLlib). In this course, as in the other ones in this MicroMasters program, you will gain hands-on experience using PySpark within the Jupyter notebooks environment.

What You'll Learn

  • Programming Spark using Pyspark
  • Identifying the computational tradeoffs in a Spark application
  • Performing data loading and cleaning using Spark and Parquet
  • Modeling data through statistical and machine learning methods

Prerequisites

  • The previous courses in the MicroMasters program: DSE200x,DSE210x and DSE220x

Instructors

Y

Yoav Freund

Professor of Computer Science and Engineering

Topics

Data Science
MapReduce
Jupyter
Big Data Analytics
Pyspark
Big Data
Machine Learning
Distributed File Systems
Hadoop Distributed File System (HDFS)
Apache Hadoop
Apache Spark

Course Info

PlatformedX
LevelAdvanced
PacingUnknown
CertificateAvailable
PriceFree to Audit

Skills

علم البيانات
MapReduce
Jupyter
تحليلات البيانات الضخمة
PySpark
Big Data
Machine Learning
Distributed File Systems
Hadoop Distributed File System (HDFS)
Apache Hadoop

Start Learning Now