Practical Data Wrangling Program in Python

Practical Data Wrangling Program in Python

This program demonstrates data wrangling techniques such as reshaping, merging, filtering, and summarizing a dataset using Pandas.

Scenario: Employee Management System  
We have two datasets: employee details and department information. The goal is to merge, reshape, filter, and summarize the data to prepare it for analysis.

Code Implementation
import pandas as pd
# Employee dataset
employee_data = {
    'Emp_ID': [101, 102, 103, 104, 105],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Dept_ID': [1, 2, 1, 3, 2],
    'Salary': [50000, 60000, 55000, 70000, 65000],
    'Join_Date': ['2021-01-15', '2020-03-22', '2022-07-01', '2019-11-10', '2018-05-20']
}
# Department dataset
department_data = {
    'Dept_ID': [1, 2, 3],
    'Department': ['HR', 'IT', 'Finance'],
    'Manager': ['John', 'Susan', 'Chris']
}
# Convert dictionaries to DataFrames
employees = pd.DataFrame(employee_data)
departments = pd.DataFrame(department_data)
# Convert 'Join_Date' to datetime
employees['Join_Date'] = pd.to_datetime(employees['Join_Date'])
# --- Step 1: Merge Datasets ---
merged_data = pd.merge(employees, departments, on='Dept_ID', how='inner')
print("Merged Data:")
print(merged_data)
# --- Step 2: Filter Data ---
# Employees with salary above 60,000
high_salary = merged_data[merged_data['Salary'] > 60000]
print("\nEmployees with Salary > 60,000:")
print(high_salary)
# --- Step 3: Summarize Data ---
# Group by department and calculate average salary
avg_salary_by_dept = merged_data.groupby('Department')['Salary'].mean().reset_index()
print("\nAverage Salary by Department:")
print(avg_salary_by_dept)
# --- Step 4: Reshape Data ---
# Pivot table showing salary by department and employee
pivot_table = merged_data.pivot_table(values='Salary', index='Department', columns='Name', fill_value=0)
print("\nPivot Table (Salary by Department and Employee):")
print(pivot_table)
# --- Step 5: Add New Columns ---
# Add a column for Years of Experience
merged_data['Experience (Years)'] = 2024 - merged_data['Join_Date'].dt.year
print("\nData with Experience Column:")
print(merged_data)
# --- Step 6: Save the Wrangled Data ---
merged_data.to_csv('wrangled_employee_data.csv', index=False)
print("\nWrangled data saved to 'wrangled_employee_data.csv'")

Program Highlights
1. Merging Datasets:
   - Combines employee and department data using `merge` based on a common key (`Dept_ID`).

2. Filtering:
   - Filters rows to find employees with a salary above a threshold.

3. Summarizing:
   - Calculates average salaries grouped by department using `groupby`.

4. Reshaping:
   - Creates a pivot table to analyze salaries by department and employee.

5. Feature Engineering:
   - Adds a new column for years of experience calculated from `Join_Date`.

6. Data Export:
   - Saves the wrangled data to a CSV file for future use.

Sample Output
Merged Data:


Key Techniques in Data Wrangling
1. Combining Data: Merge datasets to unify data sources.
2. Transforming Data: Create new columns or change data structures.
3. Filtering: Extract subsets of data for targeted analysis.
4. Summarizing: Aggregate data for insights.
5. Reshaping: Change the format of the data for better visualization or analysis.

This program provides a solid foundation for data wrangling in Python and can be extended for more complex datasets.

Comments