Practical Data Wrangling Program in Python

This program demonstrates data wrangling techniques such as reshaping, merging, filtering, and summarizing a dataset using Pandas.

Scenario: Employee Management System

We have two datasets: employee details and department information. The goal is to merge, reshape, filter, and summarize the data to prepare it for analysis.

Code Implementation

import pandas as pd

# Employee dataset

employee_data = {

'Emp_ID': [101, 102, 103, 104, 105],

'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],

'Dept_ID': [1, 2, 1, 3, 2],

'Salary': [50000, 60000, 55000, 70000, 65000],

'Join_Date': ['2021-01-15', '2020-03-22', '2022-07-01', '2019-11-10', '2018-05-20']

}

# Department dataset

department_data = {

'Dept_ID': [1, 2, 3],

'Department': ['HR', 'IT', 'Finance'],

'Manager': ['John', 'Susan', 'Chris']

}

# Convert dictionaries to DataFrames

employees = pd.DataFrame(employee_data)

departments = pd.DataFrame(department_data)

# Convert 'Join_Date' to datetime

employees['Join_Date'] = pd.to_datetime(employees['Join_Date'])

# --- Step 1: Merge Datasets ---

merged_data = pd.merge(employees, departments, on='Dept_ID', how='inner')

print("Merged Data:")

print(merged_data)

# --- Step 2: Filter Data ---

# Employees with salary above 60,000

high_salary = merged_data[merged_data['Salary'] > 60000]

print("\nEmployees with Salary > 60,000:")

print(high_salary)

# --- Step 3: Summarize Data ---

# Group by department and calculate average salary

avg_salary_by_dept = merged_data.groupby('Department')['Salary'].mean().reset_index()

print("\nAverage Salary by Department:")

print(avg_salary_by_dept)

# --- Step 4: Reshape Data ---

# Pivot table showing salary by department and employee

pivot_table = merged_data.pivot_table(values='Salary', index='Department', columns='Name', fill_value=0)

print("\nPivot Table (Salary by Department and Employee):")

print(pivot_table)

# --- Step 5: Add New Columns ---

# Add a column for Years of Experience

merged_data['Experience (Years)'] = 2024 - merged_data['Join_Date'].dt.year

print("\nData with Experience Column:")

print(merged_data)

# --- Step 6: Save the Wrangled Data ---

merged_data.to_csv('wrangled_employee_data.csv', index=False)

print("\nWrangled data saved to 'wrangled_employee_data.csv'")

Program Highlights

1. Merging Datasets:

- Combines employee and department data using `merge` based on a common key (`Dept_ID`).

2. Filtering:

- Filters rows to find employees with a salary above a threshold.

3. Summarizing:

- Calculates average salaries grouped by department using `groupby`.

4. Reshaping:

- Creates a pivot table to analyze salaries by department and employee.

5. Feature Engineering:

- Adds a new column for years of experience calculated from `Join_Date`.

6. Data Export:

- Saves the wrangled data to a CSV file for future use.

Sample Output

Merged Data:

Key Techniques in Data Wrangling

1. Combining Data: Merge datasets to unify data sources.

2. Transforming Data: Create new columns or change data structures.

3. Filtering: Extract subsets of data for targeted analysis.

4. Summarizing: Aggregate data for insights.

5. Reshaping: Change the format of the data for better visualization or analysis.

This program provides a solid foundation for data wrangling in Python and can be extended for more complex datasets.

Technology with Vivek

Search This Blog

Practical Data Wrangling Program in Python

Practical Data Wrangling Program in Python

Comments

Post a Comment