Data Cleaning and Preparation in Python

Data Cleaning and Preparation in Python

Here’s a practical program that demonstrates data cleaning and preparation using Pandas. This program loads a dataset, identifies and handles missing values, removes duplicates, standardizes data, and creates new columns.

import pandas as pd
import numpy as np
# Sample Data: Create a DataFrame with some issues (missing values, duplicates, etc.)
data = {
    'Name': ['Alice', 'Bob', 'Charlie', None, 'Eva', 'Bob'],
    'Age': [24, 27, 22, 25, None, 27],
    'Department': ['HR', 'IT', 'Finance', 'HR', 'IT', 'IT'],
    'Salary': [50000, 60000, None, 52000, 65000, 60000],
    'Joining_Date': ['2021-01-15', '2020-03-22', '2022-07-01', None, '2019-11-10', '2020-03-22']
}
# Create the DataFrame
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# --- Step 1: Handle Missing Values ---
# Fill missing 'Name' with "Unknown"
df['Name'] = df['Name'].fillna('Unknown')
# Fill missing 'Age' with the average age
df['Age'] = df['Age'].fillna(df['Age'].mean())
# Fill missing 'Salary' with the median salary
df['Salary'] = df['Salary'].fillna(df['Salary'].median())
# Convert 'Joining_Date' to datetime and fill missing dates with a default value
df['Joining_Date'] = pd.to_datetime(df['Joining_Date'], errors='coerce')
df['Joining_Date'] = df['Joining_Date'].fillna(pd.to_datetime('2000-01-01'))
# --- Step 2: Remove Duplicates ---
df = df.drop_duplicates()
# --- Step 3: Standardize Data ---
# Convert 'Department' to lowercase
df['Department'] = df['Department'].str.lower()
# --- Step 4: Create New Columns ---
# Create a column for years of experience based on 'Joining_Date'
df['Experience (Years)'] = 2024 - df['Joining_Date'].dt.year
# Create a column categorizing employees based on their salary
df['Salary Category'] = pd.cut(
    df['Salary'],
    bins=[0, 50000, 60000, 70000],
    labels=['Low', 'Medium', 'High']
)
# --- Step 5: Validate Data ---
print("\nCleaned and Prepared DataFrame:")
print(df)
# --- Step 6: Save Cleaned Data ---
df.to_csv('cleaned_employee_data.csv', index=False)
print("\nCleaned data saved to 'cleaned_employee_data.csv'")

Program Highlights
1. Handling Missing Values:
   - Uses methods like `fillna` to replace missing values with mean, median, or default values.
   - Converts strings to datetime with error handling.
2. Removing Duplicates:
   - Removes duplicate rows using `drop_duplicates`.
3. Standardizing Data:
   - Ensures consistent formatting, e.g., lowercase for department names.
4. Feature Engineering:
   - Adds new columns for years of experience and salary categorization using custom logic.
5. Data Export:
   - Saves the cleaned dataset to a CSV file for further use

Sample Output
Original DataFrame:
          Name       Age     Department   Salary         Joining_Date
0        Alice        24.0        HR              50000.0        2021-01-15
1        Bob          27.0        IT               60000.0         2020-03-22
      Charlie    22.0    Finance          NaN              2022-07-01
3        None        25.0        HR             52000.0         None
4        Eva          NaN        IT              65000.0         2019-11-10
5        Bob          27.0        IT               60000.0         2020-03-22

Cleaned and Prepared DataFrame:
Key Features
  • Missing Data Handling: Ensures no empty fields in the final dataset.
  • Data Standardization: Provides consistency across columns like `Department`.
  • Feature Engineering: Adds meaningful insights, e.g., years of experience.
  • Data Validation: Outputs the cleaned data for review and saves it for future analysis.
This program is practical for real-world scenarios involving incomplete or inconsistent datasets.

Comments