Data Cleaning and Preparation in Python

Here’s a practical program that demonstrates data cleaning and preparation using Pandas. This program loads a dataset, identifies and handles missing values, removes duplicates, standardizes data, and creates new columns.

import pandas as pd

import numpy as np

# Sample Data: Create a DataFrame with some issues (missing values, duplicates, etc.)

data = {

'Name': ['Alice', 'Bob', 'Charlie', None, 'Eva', 'Bob'],

'Age': [24, 27, 22, 25, None, 27],

'Department': ['HR', 'IT', 'Finance', 'HR', 'IT', 'IT'],

'Salary': [50000, 60000, None, 52000, 65000, 60000],

'Joining_Date': ['2021-01-15', '2020-03-22', '2022-07-01', None, '2019-11-10', '2020-03-22']

}

# Create the DataFrame

df = pd.DataFrame(data)

print("Original DataFrame:")

print(df)

# --- Step 1: Handle Missing Values ---

# Fill missing 'Name' with "Unknown"

df['Name'] = df['Name'].fillna('Unknown')

# Fill missing 'Age' with the average age

df['Age'] = df['Age'].fillna(df['Age'].mean())

# Fill missing 'Salary' with the median salary

df['Salary'] = df['Salary'].fillna(df['Salary'].median())

# Convert 'Joining_Date' to datetime and fill missing dates with a default value

df['Joining_Date'] = pd.to_datetime(df['Joining_Date'], errors='coerce')

df['Joining_Date'] = df['Joining_Date'].fillna(pd.to_datetime('2000-01-01'))

# --- Step 2: Remove Duplicates ---

df = df.drop_duplicates()

# --- Step 3: Standardize Data ---

# Convert 'Department' to lowercase

df['Department'] = df['Department'].str.lower()

# --- Step 4: Create New Columns ---

# Create a column for years of experience based on 'Joining_Date'

df['Experience (Years)'] = 2024 - df['Joining_Date'].dt.year

# Create a column categorizing employees based on their salary

df['Salary Category'] = pd.cut(

df['Salary'],

bins=[0, 50000, 60000, 70000],

labels=['Low', 'Medium', 'High']

)

# --- Step 5: Validate Data ---

print("\nCleaned and Prepared DataFrame:")

print(df)

# --- Step 6: Save Cleaned Data ---

df.to_csv('cleaned_employee_data.csv', index=False)

print("\nCleaned data saved to 'cleaned_employee_data.csv'")

Program Highlights

1. Handling Missing Values:

- Uses methods like `fillna` to replace missing values with mean, median, or default values.

- Converts strings to datetime with error handling.

2. Removing Duplicates:

- Removes duplicate rows using `drop_duplicates`.

3. Standardizing Data:

- Ensures consistent formatting, e.g., lowercase for department names.

4. Feature Engineering:

- Adds new columns for years of experience and salary categorization using custom logic.

5. Data Export:

- Saves the cleaned dataset to a CSV file for further use

Sample Output

Original DataFrame:

Name Age Department Salary Joining_Date

0 Alice 24.0 HR 50000.0 2021-01-15

1 Bob 27.0 IT 60000.0 2020-03-22

2 Charlie 22.0 Finance NaN 2022-07-01

3 None 25.0 HR 52000.0 None

4 Eva NaN IT 65000.0 2019-11-10

5 Bob 27.0 IT 60000.0 2020-03-22

Cleaned and Prepared DataFrame:

Key Features

Missing Data Handling: Ensures no empty fields in the final dataset.
Data Standardization: Provides consistency across columns like `Department`.
Feature Engineering: Adds meaningful insights, e.g., years of experience.
Data Validation: Outputs the cleaned data for review and saves it for future analysis.

This program is practical for real-world scenarios involving incomplete or inconsistent datasets.

Technology with Vivek

Search This Blog

Data Cleaning and Preparation in Python

Data Cleaning and Preparation in Python

Comments

Post a Comment