Practical Program: Data Aggregation in Python
This program demonstrates data aggregation using Pandas to analyze a dataset. Aggregation operations such as `sum`, `mean`, `count`, and `max` is applied to group and summarize the data.
Scenario: Sales Analysis
We have a dataset of sales transactions for multiple regions, products, and salespersons. The goal is to aggregate and analyze the data to extract insights.
Code Implementation
import pandas as pd
# Sample dataset
data = {
'Region': ['North', 'North', 'South', 'South', 'East', 'East', 'West', 'West'],
'Salesperson': ['Alice', 'Bob', 'Alice', 'Charlie', 'Eva', 'David', 'Frank', 'Grace'],
'Product': ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'],
'Sales': [200, 300, 400, 500, 250, 450, 600, 550],
'Units_Sold': [10, 15, 20, 25, 12, 18, 30, 28],
'Month': ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Apr', 'Apr']
}
# Create a DataFrame
df = pd.DataFrame(data)
print("Original Data:")
print(df)
# --- 1. Group by Region and Aggregate Sales and Units Sold ---
region_agg = df.groupby('Region').agg({
'Sales': 'sum',
'Units_Sold': ['sum', 'mean']
})
print("\nAggregated Sales and Units Sold by Region:")
print(region_agg)
# --- 2. Group by Product and Find Total Sales and Maximum Units Sold ---
product_agg = df.groupby('Product').agg({
'Sales': 'sum',
'Units_Sold': 'max'
})
print("\nTotal Sales and Max Units Sold by Product:")
print(product_agg)
# --- 3. Group by Region and Product for Combined Analysis ---
region_product_agg = df.groupby(['Region', 'Product']).agg({
'Sales': 'sum',
'Units_Sold': 'mean'
})
print("\nSales and Average Units Sold by Region and Product:")
print(region_product_agg)
# --- 4. Aggregate Data by Salesperson ---
salesperson_agg = df.groupby('Salesperson').agg({
'Sales': ['sum', 'max'],
'Units_Sold': ['mean', 'count']
})
print("\nSales and Units Sold Aggregation by Salesperson:")
print(salesperson_agg)
# --- 5. Aggregate Data by Month ---
month_agg = df.groupby('Month').agg({
'Sales': 'sum',
'Units_Sold': 'sum'
})
print("\nMonthly Sales and Total Units Sold:")
print(month_agg)
# --- Optional: Resetting Index for Flattened Output ---
region_agg_reset = region_agg.reset_index()
print("\nFlattened Aggregated Data by Region:")
print(region_agg_reset)
Explanation of Code
1. Group by Single Column:
- Aggregates data by `Region` or `Product` using multiple functions such as `sum` and `mean`.
2. Group by Multiple Columns:
- Groups by both `Region` and `Product` for a detailed breakdown of sales and units sold.
3. Custom Aggregation:
- Uses dictionaries to apply different aggregation functions to specific columns.
4. Advanced Aggregation:
- Aggregates data by `Salesperson` and `Month` to analyze performance and monthly trends.
5. Reset Index:
- Converts the hierarchical index into a flat data frame for easier readability.
Sample Output
Original Data:
Region Salesperson Product Sales Units_Sold Month
0 North Alice A 200 10 Jan
1 North Bob A 300 15 Jan
2 South Alice B 400 20 Feb
3 South Charlie B 500 25 Feb
4 East Eva C 250 12 Mar
5 East David C 450 18 Mar
6 West Frank D 600 30 Apr
7 West Grace D 550 28 Apr
Aggregated Sales and Units Sold by Region:
Sales Units_Sold
sum sum mean
Region
East 700 30 15.0
North 500 25 12.5
South 900 45 22.5
West 1150 58 29.0
Total Sales and Max Units Sold by Product:
Sales Units_Sold
Product
A 500 15
B 900 25
C 700 18
D 1150 30
Sales and Average Units Sold by Region and Product:
Sales Units_Sold
Region Product
East C 700 15.0
North A 500 12.5
South B 900 22.5
West D 1150 29.0
Sales and Units Sold Aggregation by Salesperson:
Sales Units_Sold
sum max mean count
Salesperson
Alice 600 400 15.0 2
Bob 300 300 15.0 1
Charlie 500 500 25.0 1
David 450 450 18.0 1
Eva 250 250 12.0 1
Frank 600 600 30.0 1
Grace 550 550 28.0 1
Monthly Sales and Total Units Sold:
Sales Units_Sold
Month
Apr 1150 58
Feb 900 45
Jan 500 25
Mar 700 30
Flattened Aggregated Data by Region:
Region Sales Units_Sold_sum Units_Sold_mean
0 East 700 30 15.0
1 North 500 25 12.5
2 South 900 45 22.5
3 West 1150 58 29.0
Applications of Data Aggregation
- Sales Analysis: Summarize sales by regions, products, or time periods.
- Performance Tracking: Monitor salesperson performance.
- Trend Analysis: Understand trends over months or quarters.
- Inventory Management: Aggregate units sold for forecasting.
This program demonstrates essential aggregation techniques for effective data summarization and insights extraction in Python.
Comments
Post a Comment