How to Drop All Duplicate Rows in Python Pandas

Avatar

By squashlabs, Last Updated: October 15, 2023

How to Drop All Duplicate Rows in Python Pandas

To drop all duplicate rows in a pandas DataFrame in Python, you can use the drop_duplicates() method. This method removes all rows that have identical values across all columns. Here are two possible approaches you can take to drop all duplicate rows in Python using pandas:

Approach 1: Using the drop_duplicates() method

The simplest and most straightforward way to drop all duplicate rows in a pandas DataFrame is by using the drop_duplicates() method. This method removes all rows that have the same values across all columns.

Here’s an example of how you can use the drop_duplicates() method to drop all duplicate rows:

import pandas as pd

# Create a sample DataFrame with duplicate rows
data = {'col1': [1, 2, 3, 2, 4, 3],
        'col2': ['A', 'B', 'C', 'B', 'D', 'C']}
df = pd.DataFrame(data)

# Drop all duplicate rows
df.drop_duplicates(inplace=True)

# Print the resulting DataFrame
print(df)

Output:

   col1 col2
0     1    A
1     2    B
2     3    C
4     4    D

In this example, we create a DataFrame with duplicate rows in the ‘col1’ and ‘col2’ columns. We then use the drop_duplicates() method with the inplace=True parameter to modify the original DataFrame and remove all duplicate rows. Finally, we print the resulting DataFrame without the duplicate rows.

Related Article: How To Create Pandas Dataframe From Variables - Valueerror

Approach 2: Dropping duplicate rows based on specific columns

In some cases, you may want to drop duplicate rows based on specific columns in your DataFrame. To achieve this, you can pass a subset of columns to the drop_duplicates() method.

Here’s an example of how you can drop duplicate rows based on specific columns:

import pandas as pd

# Create a sample DataFrame with duplicate rows
data = {'col1': [1, 2, 3, 2, 4, 3],
        'col2': ['A', 'B', 'C', 'B', 'D', 'C'],
        'col3': ['X', 'Y', 'Z', 'Y', 'W', 'Z']}
df = pd.DataFrame(data)

# Drop duplicate rows based on 'col1' and 'col2'
df.drop_duplicates(subset=['col1', 'col2'], inplace=True)

# Print the resulting DataFrame
print(df)

Output:

   col1 col2 col3
0     1    A    X
1     2    B    Y
2     3    C    Z
4     4    D    W

In this example, we create a DataFrame with duplicate rows in the ‘col1’, ‘col2’, and ‘col3’ columns. We then use the drop_duplicates() method with the subset=['col1', 'col2'] parameter to drop duplicate rows based on the values in the ‘col1’ and ‘col2’ columns. Finally, we print the resulting DataFrame without the duplicate rows.

Best practices and considerations

When dropping duplicate rows in a pandas DataFrame, keep the following considerations in mind:

1. Be cautious when using the inplace=True parameter with the drop_duplicates() method. This parameter modifies the original DataFrame in place, meaning that it permanently removes the duplicate rows from the DataFrame. If you want to preserve the original DataFrame, consider assigning the result of the drop_duplicates() method to a new DataFrame variable.

2. If you want to drop duplicate rows based on a subset of columns, make sure to pass the column names as a list to the subset parameter of the drop_duplicates() method. You can include multiple columns by providing a list of column names.

3. By default, the drop_duplicates() method keeps the first occurrence of each unique row and removes all subsequent occurrences. If you want to keep the last occurrence of each unique row and remove all previous occurrences, you can use the keep='last' parameter.

4. If you want to drop duplicate rows based on a specific column and keep the first or last occurrence, you can use the drop_duplicates() method with the subset and keep parameters. For example, to drop duplicate rows based on the ‘col1’ column and keep the first occurrence, you can use df.drop_duplicates(subset=['col1'], keep='first').

5. If you want to drop duplicate rows based on a specific column and keep all occurrences, you can use the duplicated() method to identify the duplicate rows and then filter the DataFrame using boolean indexing. For example, to drop duplicate rows based on the ‘col1’ column and keep all occurrences, you can use df[~df.duplicated(subset=['col1'])].

Overall, the drop_duplicates() method provides a convenient way to drop all duplicate rows in a pandas DataFrame. By specifying the appropriate parameters, you can customize the behavior of the method to suit your specific requirements.

For more information on the drop_duplicates() method and other data manipulation techniques in pandas, you can refer to the official pandas documentation: pandas.DataFrame.drop_duplicates().

Related Article: How to Sort a Pandas Dataframe by One Column in Python

More Articles from the How to do Data Analysis with Python & Pandas series:

How to Select Multiple Columns in a Pandas Dataframe

Selecting multiple columns in a Pandas dataframe using Python is a common task for data analysis. This article provides a step-by-step guide on how to achieve this using... read more

How To Reset Index In A Pandas Dataframe

Resetting the index in a Pandas dataframe using Python is a process. This article provides two methods for resetting the index: using the reset_index() method and using... read more

How to Create and Fill an Empty Pandas DataFrame in Python

Creating an empty Pandas DataFrame in Python is a common task for data analysis and manipulation. This article will guide you through the process of creating an empty... read more

Fixing ‘Dataframe Constructor Not Properly Called’ in Python

"Guide on resolving 'Dataframe Constructor Not Properly Called' error in Python. This article provides step-by-step instructions to fix the error and get your DataFrame... read more

How To Handle Ambiguous Truth Value In Python Series

Learn how to handle ambiguous truth value in Python series using a.empty, a.bool(), a.item(), a.any() or a.all(). This article covers background information and specific... read more