To drop all duplicate rows in a pandas DataFrame in Python, you can use the drop_duplicates()
method. This method removes all rows that have identical values across all columns. Here are two possible approaches you can take to drop all duplicate rows in Python using pandas:
Approach 1: Using the drop_duplicates() method
The simplest and most straightforward way to drop all duplicate rows in a pandas DataFrame is by using the drop_duplicates()
method. This method removes all rows that have the same values across all columns.
Here’s an example of how you can use the drop_duplicates()
method to drop all duplicate rows:
import pandas as pd # Create a sample DataFrame with duplicate rows data = {'col1': [1, 2, 3, 2, 4, 3], 'col2': ['A', 'B', 'C', 'B', 'D', 'C']} df = pd.DataFrame(data) # Drop all duplicate rows df.drop_duplicates(inplace=True) # Print the resulting DataFrame print(df)
Output:
col1 col2 0 1 A 1 2 B 2 3 C 4 4 D
In this example, we create a DataFrame with duplicate rows in the ‘col1’ and ‘col2’ columns. We then use the drop_duplicates()
method with the inplace=True
parameter to modify the original DataFrame and remove all duplicate rows. Finally, we print the resulting DataFrame without the duplicate rows.
Related Article: How To Create Pandas Dataframe From Variables - Valueerror
Approach 2: Dropping duplicate rows based on specific columns
In some cases, you may want to drop duplicate rows based on specific columns in your DataFrame. To achieve this, you can pass a subset of columns to the drop_duplicates()
method.
Here’s an example of how you can drop duplicate rows based on specific columns:
import pandas as pd # Create a sample DataFrame with duplicate rows data = {'col1': [1, 2, 3, 2, 4, 3], 'col2': ['A', 'B', 'C', 'B', 'D', 'C'], 'col3': ['X', 'Y', 'Z', 'Y', 'W', 'Z']} df = pd.DataFrame(data) # Drop duplicate rows based on 'col1' and 'col2' df.drop_duplicates(subset=['col1', 'col2'], inplace=True) # Print the resulting DataFrame print(df)
Output:
col1 col2 col3 0 1 A X 1 2 B Y 2 3 C Z 4 4 D W
In this example, we create a DataFrame with duplicate rows in the ‘col1’, ‘col2’, and ‘col3’ columns. We then use the drop_duplicates()
method with the subset=['col1', 'col2']
parameter to drop duplicate rows based on the values in the ‘col1’ and ‘col2’ columns. Finally, we print the resulting DataFrame without the duplicate rows.
Best practices and considerations
When dropping duplicate rows in a pandas DataFrame, keep the following considerations in mind:
1. Be cautious when using the inplace=True
parameter with the drop_duplicates()
method. This parameter modifies the original DataFrame in place, meaning that it permanently removes the duplicate rows from the DataFrame. If you want to preserve the original DataFrame, consider assigning the result of the drop_duplicates()
method to a new DataFrame variable.
2. If you want to drop duplicate rows based on a subset of columns, make sure to pass the column names as a list to the subset
parameter of the drop_duplicates()
method. You can include multiple columns by providing a list of column names.
3. By default, the drop_duplicates()
method keeps the first occurrence of each unique row and removes all subsequent occurrences. If you want to keep the last occurrence of each unique row and remove all previous occurrences, you can use the keep='last'
parameter.
4. If you want to drop duplicate rows based on a specific column and keep the first or last occurrence, you can use the drop_duplicates()
method with the subset
and keep
parameters. For example, to drop duplicate rows based on the ‘col1’ column and keep the first occurrence, you can use df.drop_duplicates(subset=['col1'], keep='first')
.
5. If you want to drop duplicate rows based on a specific column and keep all occurrences, you can use the duplicated()
method to identify the duplicate rows and then filter the DataFrame using boolean indexing. For example, to drop duplicate rows based on the ‘col1’ column and keep all occurrences, you can use df[~df.duplicated(subset=['col1'])]
.
Overall, the drop_duplicates()
method provides a convenient way to drop all duplicate rows in a pandas DataFrame. By specifying the appropriate parameters, you can customize the behavior of the method to suit your specific requirements.
For more information on the drop_duplicates()
method and other data manipulation techniques in pandas, you can refer to the official pandas documentation: pandas.DataFrame.drop_duplicates().
Related Article: How to Sort a Pandas Dataframe by One Column in Python