Data Cleaning and Preparation in Python
Here’s a practical program that demonstrates data cleaning and preparation using Pandas. This program loads a dataset, identifies and handles missing values, removes duplicates, standardizes data, and creates new columns.
import pandas as pd
import numpy as np
# Sample Data: Create a DataFrame with some issues (missing values, duplicates, etc.)
data = {
'Name': ['Alice', 'Bob', 'Charlie', None, 'Eva', 'Bob'],
'Age': [24, 27, 22, 25, None, 27],
'Department': ['HR', 'IT', 'Finance', 'HR', 'IT', 'IT'],
'Salary': [50000, 60000, None, 52000, 65000, 60000],
'Joining_Date': ['2021-01-15', '2020-03-22', '2022-07-01', None, '2019-11-10', '2020-03-22']
}
# Create the DataFrame
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# --- Step 1: Handle Missing Values ---
# Fill missing 'Name' with "Unknown"
df['Name'] = df['Name'].fillna('Unknown')
# Fill missing 'Age' with the average age
df['Age'] = df['Age'].fillna(df['Age'].mean())
# Fill missing 'Salary' with the median salary
df['Salary'] = df['Salary'].fillna(df['Salary'].median())
# Convert 'Joining_Date' to datetime and fill missing dates with a default value
df['Joining_Date'] = pd.to_datetime(df['Joining_Date'], errors='coerce')
df['Joining_Date'] = df['Joining_Date'].fillna(pd.to_datetime('2000-01-01'))
# --- Step 2: Remove Duplicates ---
df = df.drop_duplicates()
# --- Step 3: Standardize Data ---
# Convert 'Department' to lowercase
df['Department'] = df['Department'].str.lower()
# --- Step 4: Create New Columns ---
# Create a column for years of experience based on 'Joining_Date'
df['Experience (Years)'] = 2024 - df['Joining_Date'].dt.year
# Create a column categorizing employees based on their salary
df['Salary Category'] = pd.cut(
df['Salary'],
bins=[0, 50000, 60000, 70000],
labels=['Low', 'Medium', 'High']
)
# --- Step 5: Validate Data ---
print("\nCleaned and Prepared DataFrame:")
print(df)
# --- Step 6: Save Cleaned Data ---
df.to_csv('cleaned_employee_data.csv', index=False)
print("\nCleaned data saved to 'cleaned_employee_data.csv'")
Program Highlights
1. Handling Missing Values:
- Uses methods like `fillna` to replace missing values with mean, median, or default values.
- Converts strings to datetime with error handling.
2. Removing Duplicates:
- Removes duplicate rows using `drop_duplicates`.
3. Standardizing Data:
- Ensures consistent formatting, e.g., lowercase for department names.
4. Feature Engineering:
- Adds new columns for years of experience and salary categorization using custom logic.
5. Data Export:
- Saves the cleaned dataset to a CSV file for further use
Sample Output
Original DataFrame:
Name Age Department Salary Joining_Date
0 Alice 24.0 HR 50000.0 2021-01-15
1 Bob 27.0 IT 60000.0 2020-03-22
2 Charlie 22.0 Finance NaN 2022-07-01
3 None 25.0 HR 52000.0 None
4 Eva NaN IT 65000.0 2019-11-10
5 Bob 27.0 IT 60000.0 2020-03-22
Key Features
- Missing Data Handling: Ensures no empty fields in the final dataset.
- Data Standardization: Provides consistency across columns like `Department`.
- Feature Engineering: Adds meaningful insights, e.g., years of experience.
- Data Validation: Outputs the cleaned data for review and saves it for future analysis.
This program is practical for real-world scenarios involving incomplete or inconsistent datasets.
Comments
Post a Comment