• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
PythonForBeginners.com

PythonForBeginners.com

Learn By Example

  • Home
  • Learn Python
    • Python Tutorial
  • Categories
    • Basics
    • Lists
    • Dictionary
    • Code Snippets
    • Comments
    • Modules
    • API
    • Beautiful Soup
    • Cheatsheet
    • Games
    • Loops
  • Python Courses
    • Python 3 For Beginners
You are here: Home / Pyspark / PySpark Select Distinct Rows From DataFrame

PySpark Select Distinct Rows From DataFrame

Author: Aditya Raj
Last Updated: July 17, 2023

When using a pyspark dataframe, we sometimes need to select unique rows or unique values from a particular column. In this article, we will discuss how to select distinct rows or values in a column of a pyspark dataframe using three different ways.

Table of Contents
  1. Select Distinct Rows From PySpark DataFrame
    1. PySpark Select Distinct Rows Using The distinct() Method
    2. Select Distinct Rows Using The dropDuplicates() Method
    3. Distinct Rows From PySpark DataFrame Using SQL
  2. Select Distinct Rows Based on Multiple Columns in PySpark DataFrame
  3. PySpark Select Unique Values in A Column
  4. Pyspark Select Distinct From Multiple Columns
  5. Conclusion

Select Distinct Rows From PySpark DataFrame

You can use three ways to select distinct rows in a dataframe in pyspark. 

  1. Using the distinct() method
  2. By using the dropDuplicates() method
  3. Using SQL Statement

Let us discuss each method one at a time.

PySpark Select Distinct Rows Using The distinct() Method

The distinct() method, when invoked on a pyspark dataframe, returns all the unique rows in the dataframe. Hence, we can directly invoke the distinct() method on a pyspark dataframe to select unique rows.

To observe this, we will first read a csv file into a pyspark dataframe. You can download the file using this link.

sample_csv_fileDownload

After reading the csv file into the pyspark dataframe, you can invoke the distinct() method on the pyspark dataframe to get distinct rows as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.distinct()
print("The dataframe with distinct rows is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The dataframe with distinct rows is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|   Joel|   45|     75|       87|
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|Katrina|   49|     47|       83|
|    Sam|   99|     98|       95|
| Agatha|   76|     93|       83|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

In the above example, you can observe that the original dataframe contains eight rows. After using the distinct() method, we get seven distinct rows from the dataframe.

Select Distinct Rows Using The dropDuplicates() Method

The dropDuplicates() method works in a similar manner to the distinct() method. When we invoke the dropDuplicates() method on a dataframe, it returns unique rows in the dataframe as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.dropDuplicates()
print("The dataframe with distinct rows is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The dataframe with distinct rows is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|   Joel|   45|     75|       87|
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|Katrina|   49|     47|       83|
|    Sam|   99|     98|       95|
| Agatha|   76|     93|       83|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

In this example, you can observe that we have used the dropDuplicates() method instead of the distinct() method.

Distinct Rows From PySpark DataFrame Using SQL

We can also select distinct rows in a pyspark dataframe using SQL syntax. For this, we will use the following steps.

  1. First, we need to create a temporary view of the dataframe using the createOrReplaceTempView() method. The createOrReplaceTempView() method takes the desired name of the temporary view of the dataframe and returns a temporary view.
  2. Once we get the temporary view, we can execute the SELECT DISTINCT SQL statement on the view of the dataframe to select unique rows. For this, we will create the string containing the SQL statement and pass it to the sql() function.
  3. After execution of the sql() function, we get the output dataframe with distinct rows.

After executing the above statements, we can get the pyspark dataframe with distinct rows as shown in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
dfs=spark.sql("SELECT DISTINCT * FROM df_sql")
print("The dataframe with distinct rows is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The dataframe with distinct rows is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|   Joel|   45|     75|       87|
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|Katrina|   49|     47|       83|
|    Sam|   99|     98|       95|
| Agatha|   76|     93|       83|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

Select Distinct Rows Based on Multiple Columns in PySpark DataFrame

In the previous examples, we have selected unique rows based on all the columns. However, we can also use specific columns to decide on unique rows.

To select distinct rows based on multiple columns, we can pass the column names by which we want to decide the uniqueness of the rows in a list to the dropDuplicates() method. After execution, the dropDuplicates() method will return a dataframe containing a unique set of values in the specified columns. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.dropDuplicates(["Name","Maths"])
print("The dataframe with distinct rows is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The dataframe with distinct rows is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
| Agatha|   76|     93|       83|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|    Sam|   99|     98|       95|
+-------+-----+-------+---------+

In this example, we first read a csv file to create a pyspark dataframe. Then, we used the dropDuplicates() method to select distinct rows having unique values in the Name and Maths Column. For this, we passed the list ["Name", "Maths"] to the dropDuplicates() method. In the output, you can observe that the pyspark dataframe contains all the columns. However, the combination of the Name and Maths columns is unique in each row.

PySpark Select Unique Values in A Column

To select distinct values from one column in a pyspark dataframe, we first need to select the particular column using the select() method. Then, we can get distinct values from the column using the distinct() method as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.select("Name").distinct()
print("The distinct values in the column are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The distinct values in the column are:
+-------+
|   Name|
+-------+
|Katrina|
|  Chris|
| Agatha|
|    Sam|
|   Joel|
| Aditya|
+-------+

In this example, we first selected the Name column using the select() method. Then, we invoked the distinct() method on the selected column to get all the unique values.

Instead of the distinct() method, you can use the dropDuplicates() method to select unique values from a column in a pyspark dataframe as shown in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.select("Name").dropDuplicates()
print("The distinct values in the column are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The distinct values in the column are:
+-------+
|   Name|
+-------+
|Katrina|
|  Chris|
| Agatha|
|    Sam|
|   Joel|
| Aditya|
+-------+

We can also use the SQL SELECT DISTINCT statement to select distinct values from a column in a pyspark dataframe as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
dfs=spark.sql("SELECT DISTINCT Name FROM df_sql")
print("The distinct values in the column are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The distinct values in the column are:
+-------+
|   Name|
+-------+
|Katrina|
|  Chris|
| Agatha|
|    Sam|
|   Joel|
| Aditya|
+-------+

Pyspark Select Distinct From Multiple Columns

To select distinct values from multiple columns, we will first select the desired columns using the select() statement. After this, we will use the distinct() method to get the unique values from the selected columns as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.select("Name","Maths").distinct()
print("The distinct values in the columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The distinct values in the columns are:
+-------+-----+
|   Name|Maths|
+-------+-----+
| Aditya|   45|
|Katrina|   49|
| Agatha|   76|
|  Chris|   78|
|    Sam|   99|
|   Joel|   45|
+-------+-----+

In this example, we have selected distinct values from the Name and Maths column. For this, we first selected both these columns from the original dataframe using the select() method. Then, we used the distinct() method to select distinct values in the columns.

Instead of the distinct() method, you can also use the dropDuplicates() method to select distinct values from multiple columns as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.select("Name","Maths").dropDuplicates()
print("The distinct values in the columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The distinct values in the columns are:
+-------+-----+
|   Name|Maths|
+-------+-----+
| Aditya|   45|
|Katrina|   49|
| Agatha|   76|
|  Chris|   78|
|    Sam|   99|
|   Joel|   45|
+-------+-----+

You can also use the SQL SELECT DISTINCT statement with column names to select unique values from multiple columns as shown in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
dfs=spark.sql("SELECT DISTINCT Name, Maths FROM df_sql")
print("The distinct values in the columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The distinct values in the columns are:
+-------+-----+
|   Name|Maths|
+-------+-----+
| Aditya|   45|
|Katrina|   49|
| Agatha|   76|
|  Chris|   78|
|    Sam|   99|
|   Joel|   45|
+-------+-----+

Conclusion

In this article, we discussed different ways to select distinct rows or values from a PySpark dataframe. To learn more about PySpark, you can read this article on how to sort dataframe in pyspark. You might also like this article on fill nan values in pandas dataframe.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

Related

Recommended Python Training

Course: Python 3 For Beginners

Over 15 hours of video content with guided instruction for beginners. Learn how to create real world applications and master the basics.

Enroll Now

Filed Under: Pyspark Author: Aditya Raj

More Python Topics

API Argv Basics Beautiful Soup Cheatsheet Code Code Snippets Command Line Comments Concatenation crawler Data Structures Data Types deque Development Dictionary Dictionary Data Structure In Python Error Handling Exceptions Filehandling Files Functions Games GUI Json Lists Loops Mechanzie Modules Modules In Python Mysql OS pip Pyspark Python Python On The Web Python Strings Queue Requests Scraping Scripts Split Strings System & OS urllib2

Primary Sidebar

Menu

  • Basics
  • Cheatsheet
  • Code Snippets
  • Development
  • Dictionary
  • Error Handling
  • Lists
  • Loops
  • Modules
  • Scripts
  • Strings
  • System & OS
  • Web

Get Our Free Guide To Learning Python

Most Popular Content

  • Reading and Writing Files in Python
  • Python Dictionary – How To Create Dictionaries In Python
  • How to use Split in Python
  • Python String Concatenation and Formatting
  • List Comprehension in Python
  • How to Use sys.argv in Python?
  • How to use comments in Python
  • Try and Except in Python

Recent Posts

  • Count Rows With Null Values in PySpark
  • PySpark OrderBy One or Multiple Columns
  • Select Rows with Null values in PySpark
  • PySpark Count Distinct Values in One or Multiple Columns
  • PySpark Filter Rows in a DataFrame by Condition

Copyright © 2012–2025 · PythonForBeginners.com

  • Home
  • Contact Us
  • Privacy Policy
  • Write For Us