Practical Data Wrangling Program in Python
This program demonstrates data wrangling techniques such as reshaping, merging, filtering, and summarizing a dataset using Pandas.
Scenario: Employee Management System
We have two datasets: employee details and department information. The goal is to merge, reshape, filter, and summarize the data to prepare it for analysis.
Code Implementation
import pandas as pd
# Employee dataset
employee_data = {
'Emp_ID': [101, 102, 103, 104, 105],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Dept_ID': [1, 2, 1, 3, 2],
'Salary': [50000, 60000, 55000, 70000, 65000],
'Join_Date': ['2021-01-15', '2020-03-22', '2022-07-01', '2019-11-10', '2018-05-20']
}
# Department dataset
department_data = {
'Dept_ID': [1, 2, 3],
'Department': ['HR', 'IT', 'Finance'],
'Manager': ['John', 'Susan', 'Chris']
}
# Convert dictionaries to DataFrames
employees = pd.DataFrame(employee_data)
departments = pd.DataFrame(department_data)
# Convert 'Join_Date' to datetime
employees['Join_Date'] = pd.to_datetime(employees['Join_Date'])
# --- Step 1: Merge Datasets ---
merged_data = pd.merge(employees, departments, on='Dept_ID', how='inner')
print("Merged Data:")
print(merged_data)
# --- Step 2: Filter Data ---
# Employees with salary above 60,000
high_salary = merged_data[merged_data['Salary'] > 60000]
print("\nEmployees with Salary > 60,000:")
print(high_salary)
# --- Step 3: Summarize Data ---
# Group by department and calculate average salary
avg_salary_by_dept = merged_data.groupby('Department')['Salary'].mean().reset_index()
print("\nAverage Salary by Department:")
print(avg_salary_by_dept)
# --- Step 4: Reshape Data ---
# Pivot table showing salary by department and employee
pivot_table = merged_data.pivot_table(values='Salary', index='Department', columns='Name', fill_value=0)
print("\nPivot Table (Salary by Department and Employee):")
print(pivot_table)
# --- Step 5: Add New Columns ---
# Add a column for Years of Experience
merged_data['Experience (Years)'] = 2024 - merged_data['Join_Date'].dt.year
print("\nData with Experience Column:")
print(merged_data)
# --- Step 6: Save the Wrangled Data ---
merged_data.to_csv('wrangled_employee_data.csv', index=False)
print("\nWrangled data saved to 'wrangled_employee_data.csv'")
Program Highlights
1. Merging Datasets:
- Combines employee and department data using `merge` based on a common key (`Dept_ID`).
2. Filtering:
- Filters rows to find employees with a salary above a threshold.
3. Summarizing:
- Calculates average salaries grouped by department using `groupby`.
4. Reshaping:
- Creates a pivot table to analyze salaries by department and employee.
5. Feature Engineering:
- Adds a new column for years of experience calculated from `Join_Date`.
6. Data Export:
- Saves the wrangled data to a CSV file for future use.
Sample Output
Merged Data:
Key Techniques in Data Wrangling
1. Combining Data: Merge datasets to unify data sources.
2. Transforming Data: Create new columns or change data structures.
3. Filtering: Extract subsets of data for targeted analysis.
4. Summarizing: Aggregate data for insights.
5. Reshaping: Change the format of the data for better visualization or analysis.
This program provides a solid foundation for data wrangling in Python and can be extended for more complex datasets.
Comments
Post a Comment