购物篮分析 - Pandas数据分析实战训练营

第一章关联规则基础

1.1 什么是关联规则

关联规则是数据挖掘中用于发现数据项之间有趣关系的方法。最经典的应用场景就是购物篮分析，比如发现购买牛奶的顾客也倾向于购买面包。

频繁项集：在数据集中频繁出现的物品组合
支持度 (Support)：P(A ∩ B)，同时购买A和B的概率
置信度 (Confidence)：P(B|A)，购买A的顾客中也购买B的概率
提升度 (Lift)：衡量A和B的关联强度

💡 关联规则广泛应用于零售行业的商品推荐和货架摆放优化。

1.2 核心指标计算

让我们理解关联规则的核心指标：

# 支持度: P(A ∩ B) - 同时购买A和B的概率
support = len(A ∩ B) / total_transactions

# 置信度: P(B|A) - 购买A的顾客中也购买B的概率
confidence = len(A ∩ B) / len(A)

# 提升度: 衡量A和B的关联强度
lift = confidence / P(B)

第二章 Apriori算法

2.1 Apriori原理

Apriori算法是最经典的关联规则挖掘算法，其核心思想是：如果一个项集是频繁的，那么它的所有子集也是频繁的。

扫描数据库生成候选集
计算支持度
剪枝不满足最小支持度的项集
迭代直到没有新的频繁项集

📌 Apriori算法的效率很大程度上取决于最小支持度阈值的选择。

2.2 Python实现Apriori

使用mlxtend库实现Apriori算法：

import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

# 创建购物篮数据
data = {
    '牛奶': [1, 1, 0, 1, 0],
    '面包': [1, 0, 1, 1, 1],
    '鸡蛋': [0, 1, 1, 0, 1],
    '黄油': [1, 1, 0, 0, 0]
}
df = pd.DataFrame(data)

# 挖掘频繁项集
frequent_itemsets = apriori(df, min_support=0.4, use_colnames=True)

# 生成关联规则
rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.7)
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

💻 在线代码编辑器

import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

# 创建购物篮数据
data = {
    '牛奶': [1, 1, 0, 1, 0],
    '面包': [1, 0, 1, 1, 1],
    '鸡蛋': [0, 1, 1, 0, 1],
    '黄油': [1, 1, 0, 0, 0]
}
df = pd.DataFrame(data)

print('购物篮数据:')
print(df)

# 挖掘频繁项集（最小支持度0.4）
frequent_itemsets = apriori(df, min_support=0.4, use_colnames=True)
print('\\n频繁项集:')
print(frequent_itemsets)

# 生成关联规则（最小置信度0.7）
rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.7)
print('\\n关联规则:')
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

📤 输出结果

第三章规则评估与筛选

3.1 规则评价指标

除了支持度和置信度，还有其他重要指标：

# 提升度 > 1 表示正相关
# 杠杆率 (Leverage): 衡量实际与期望频率的差异
leverage = support - P(A) * P(B)

# 确信度 (Conviction): 衡量规则的可靠性  
conviction = (1 - P(B)) / (1 - confidence)

3.2 规则筛选技巧

如何从大量规则中筛选有价值的规则：

设置合理的最小支持度和置信度
关注提升度大于1的规则
优先选择简洁的规则
结合业务知识筛选

💡 并非所有强规则都有实际业务价值，需要领域专家的参与。

第四章实战案例

4.1 购物篮分析实践

分析超市购物篮数据，发现有趣的商品关联：

# 创建模拟购物篮数据
transactions = [
    ['苹果', '香蕉', '牛奶'],
    ['香蕉', '面包', '鸡蛋'],
    ['苹果', '香蕉', '面包'],
    ['牛奶', '面包', '鸡蛋'],
    ['苹果', '牛奶', '面包'],
    ['香蕉', '牛奶'],
    ['苹果', '香蕉', '牛奶', '面包'],
    ['面包', '鸡蛋']
]

# 转换为适合Apriori的格式
unique_items = sorted(set(item for t in transactions for item in t))
basket_data = []
for t in transactions:
    row = [1 if item in t else 0 for item in unique_items]
    basket_data.append(row)

df = pd.DataFrame(basket_data, columns=unique_items)

# 挖掘频繁项集
frequent_itemsets = apriori(df, min_support=0.3, use_colnames=True)
rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.6)

练习题（共5题）

练习 1：创建购物篮数据

创建一个包含5个交易记录的购物篮数据集，每个交易包含3-5个商品。

import pandas as pd

# 创建购物篮数据
data = {
    'Transaction': ['T1', 'T2', 'T3', 'T4', 'T5'],
    'Items': [['牛奶', '面包', '鸡蛋'],
              ['面包', '黄油'],
              ['牛奶', '面包', '黄油'],
              ['鸡蛋', '黄油'],
              ['牛奶', '鸡蛋']]
}

# 转换为适合Apriori的格式（0-1矩阵）
# 提示：先获取所有唯一商品，然后转换每个交易

print("转换后的购物篮数据:")
print(basket_df)

📤 输出结果

练习 2：计算支持度

计算单个商品和商品组合的支持度。

import pandas as pd

data = {
    '牛奶': [1, 1, 0, 1, 0],
    '面包': [1, 0, 1, 1, 1],
    '鸡蛋': [0, 1, 1, 0, 1],
    '黄油': [1, 1, 0, 0, 0]
}
df = pd.DataFrame(data)

# 计算单个商品的支持度
item_support = df.mean()
print("单个商品支持度:")
print(item_support)

# 计算{牛奶,面包}的支持度
milk_bread_support = (df['牛奶'] & df['面包']).mean()
print("\\n{牛奶,面包}支持度:", milk_bread_support)

📤 输出结果

练习 3：应用Apriori算法

使用mlxtend库挖掘频繁项集和关联规则。

import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

# 创建购物篮数据
data = {
    '牛奶': [1, 1, 0, 1, 0, 1],
    '面包': [1, 0, 1, 1, 1, 1],
    '鸡蛋': [0, 1, 1, 0, 1, 0],
    '黄油': [1, 1, 0, 0, 0, 1]
}
df = pd.DataFrame(data)

# 挖掘频繁项集（最小支持度0.4）
frequent_itemsets = apriori(df, min_support=0.4, use_colnames=True)
print("频繁项集:")
print(frequent_itemsets)

# 生成关联规则（最小置信度0.7）
rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.7)
print("\\n关联规则:")
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

📤 输出结果

练习 4：规则筛选

从关联规则中筛选出有价值的规则。

import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

data = {
    'A': [1, 1, 0, 1, 0, 1, 1, 0],
    'B': [1, 0, 1, 1, 1, 1, 0, 1],
    'C': [0, 1, 1, 0, 1, 0, 0, 1],
    'D': [1, 1, 0, 0, 0, 1, 1, 0]
}
df = pd.DataFrame(data)

frequent_itemsets = apriori(df, min_support=0.3, use_colnames=True)
rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.5)

# 筛选提升度大于1的规则
useful_rules = rules[rules['lift'] > 1]
print("有用的关联规则（提升度>1）:")
print(useful_rules[['antecedents', 'consequents', 'lift']])

📤 输出结果

练习 5：完整购物篮分析

综合运用所学知识进行完整的购物篮分析。

import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

# 模拟真实购物篮数据
transactions = [
    ['苹果', '香蕉', '牛奶'],
    ['香蕉', '面包', '鸡蛋'],
    ['苹果', '香蕉', '面包'],
    ['牛奶', '面包', '鸡蛋'],
    ['苹果', '牛奶', '面包'],
    ['香蕉', '牛奶'],
    ['苹果', '香蕉', '牛奶', '面包'],
    ['面包', '鸡蛋']
]

# 转换为DataFrame格式
unique_items = sorted(set(item for transaction in transactions for item in transaction))
basket_data = []
for transaction in transactions:
    row = [1 if item in transaction else 0 for item in unique_items]
    basket_data.append(row)

df = pd.DataFrame(basket_data, columns=unique_items)

# 挖掘频繁项集
frequent_itemsets = apriori(df, min_support=0.3, use_colnames=True)
print("频繁项集:")
print(frequent_itemsets)

# 生成关联规则
rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.6)
print("\\n关联规则:")
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

📤 输出结果

测试题（共 5 题，满分 100 分）

问题 1 / 5

关联规则中，支持度的计算公式是什么？

问题 2 / 5

Apriori算法的核心思想是什么？

问题 3 / 5

提升度大于1表示什么？

问题 4 / 5

以下哪个不是关联规则的评价指标？

问题 5 / 5

购物篮分析最典型的应用场景是什么？

🎉

测试完成！

0分

太棒了！继续努力！

返回首页 →

作业提交

📝 购物篮分析作业

请完成以下购物篮分析任务：

1. 创建包含10个交易记录的购物篮数据集
2. 使用Apriori算法挖掘频繁项集
3. 生成关联规则并筛选提升度大于1的规则
4. 分析结果并给出业务建议

💻 作业代码提交

📤 代码运行结果