数据挖掘之Apriori算法详解和Python实现代码分享

1718次阅读  |  发布于5年以前

关联规则挖掘(Association rule mining)是数据挖掘中最活跃的研究方法之一,可以用来发现事情之间的联系,最早是为了发现超市交易数据库中不同的商品之间的关系。(啤酒与尿布)

基本概念

1、支持度的定义:support(X-->Y) = |X交Y|/N=集合X与集合Y中的项在一条记录中同时出现的次数/数据记录的个数。例如:support({啤酒}-->{尿布}) = 啤酒和尿布同时出现的次数/数据记录数 = 3/5=60%。

2、自信度的定义:confidence(X-->Y) = |X交Y|/|X| = 集合X与集合Y中的项在一条记录中同时出现的次数/集合X出现的个数 。例如:confidence({啤酒}-->{尿布}) = 啤酒和尿布同时出现的次数/啤酒出现的次数=3/3=100%;confidence({尿布}-->{啤酒}) = 啤酒和尿布同时出现的次数/尿布出现的次数 = 3/4 = 75%

同时满足最小支持度阈值(min_sup)和最小置信度阈值(min_conf)的规则称作强规则 ,如果项集满足最小支持度,则称它为频繁项集

"如何由大型数据库挖掘关联规则?"关联规则的挖掘是一个两步的过程:

1、找出所有频繁项集:根据定义,这些项集出现的频繁性至少和预定义的最小支持计数一样。
2、由频繁项集产生强关联规则:根据定义,这些规则必须满足最小支持度和最小置信度。

Apriori定律

为了减少频繁项集的生成时间,我们应该尽早的消除一些完全不可能是频繁项集的集合,Apriori的两条定律就是干这事的。

Apriori定律1:如果一个集合是频繁项集,则它的所有子集都是频繁项集。举例:假设一个集合{A,B}是频繁项集,即A、B同时出现在一条记录的次数大于等于最小支持度min_support,则它的子集{A},{B}出现次数必定大于等于min_support,即它的子集都是频繁项集。

Apriori定律2:如果一个集合不是频繁项集,则它的所有超集都不是频繁项集。举例:假设集合{A}不是频繁项集,即A出现的次数小于min_support,则它的任何超集如{A,B}出现的次数必定小于min_support,因此其超集必定也不是频繁项集。

上面的图演示了Apriori算法的过程,注意看由二级频繁项集生成三级候选项集时,没有{牛奶,面包,啤酒},那是因为{面包,啤酒}不是二级频繁项集,这里利用了Apriori定理。最后生成三级频繁项集后,没有更高一级的候选项集,因此整个算法结束,{牛奶,面包,尿布}是最大频繁子集。

Python实现代码:

复制代码 代码如下:

Skip to content
Sign up Sign in This repository
Explore
Features
Enterprise
Blog
Star 0 Fork 0 taizilongxu/datamining
branch: master datamining / apriori / apriori.py
hackerxutaizilongxu 20 days ago backup
1 contributor
156 lines (140 sloc) 6.302 kb RawBlameHistory

-- encoding: UTF-8 --

---------------------------------import------------------------------------

---------------------------------------------------------------------------

class Apriori(object):

def __init__(self, filename, min_support, item_start, item_end):  
    self.filename = filename  
    self.min_support = min_support # 最小支持度  
    self.min_confidence = 50  
    self.line_num = 0 # item的行数  
    self.item_start = item_start #  取哪行的item  
    self.item_end = item_end

    self.location = [[i] for i in range(self.item_end - self.item_start + 1)]  
    self.support = self.sut(self.location)  
    self.num = list(sorted(set([j for i in self.location for j in i])))# 记录item

    self.pre_support = [] # 保存前一个support,location,num  
    self.pre_location = []  
    self.pre_num = []

    self.item_name = [] # 项目名  
    self.find_item_name()  
    self.loop()  
    self.confidence_sup()

def deal_line(self, line):  
    "提取出需要的项"  
    return [i.strip() for i in line.split(' ') if i][self.item_start - 1:self.item_end]

def find_item_name(self):  
    "根据第一行抽取item_name"  
    with open(self.filename, 'r') as F:  
        for index,line in enumerate(F.readlines()):  
            if index == 0:  
                self.item_name = self.deal_line(line)  
                break

def sut(self, location):  
    """  
    输入[[1,2,3],[2,3,4],[1,3,5]...]  
    输出每个位置集的support [123,435,234...]  
    """  
    with open(self.filename, 'r') as F:  
        support = [0] * len(location)  
        for index,line in enumerate(F.readlines()):  
            if index == 0: continue  
            # 提取每信息  
            item_line = self.deal_line(line)  
            for index_num,i in enumerate(location):  
                flag = 0  
                for j in i:  
                    if item_line[j] != 'T':  
                        flag = 1  
                        break  
                if not flag:  
                    support[index_num] += 1  
        self.line_num = index # 一共多少行,出去第一行的item_name  
    return support

def select(self, c):  
    "返回位置"  
    stack = []  
    for i in self.location:  
        for j in self.num:  
            if j in i:  
                if len(i) == c:  
                    stack.append(i)  
            else:  
                stack.append([j] + i)  
    # 多重列表去重  
    import itertools  
    s = sorted([sorted(i) for i in stack])  
    location = list(s for s,_ in itertools.groupby(s))  
    return location

def del_location(self, support, location):  
    "清除不满足条件的候选集"  
    # 小于最小支持度的剔除  
    for index,i in enumerate(support):  
        if i < self.line_num * self.min_support / 100:  
            support[index] = 0  
    # apriori第二条规则,剔除  
    for index,j in enumerate(location):  
        sub_location = [j[:index_loc] + j[index_loc+1:]for index_loc in range(len(j))]  
        flag = 0  
        for k in sub_location:  
            if k not in self.location:  
                flag = 1  
                break  
        if flag:  
            support[index] = 0  
    # 删除没用的位置  
    location = [i for i,j in zip(location,support) if j != 0]  
    support = [i for i in support if i != 0]  
    return support, location

def loop(self):  
    "s级频繁项级的迭代"  
    s = 2  
    while True:  
        print '-'*80  
        print 'The' ,s - 1,'loop'  
        print 'location' , self.location  
        print 'support' , self.support  
        print 'num' , self.num  
        print '-'*80

        # 生成下一级候选集  
        location = self.select(s)  
        support = self.sut(location)  
        support, location = self.del_location(support, location)  
        num = list(sorted(set([j for i in location for j in i])))  
        s += 1  
        if  location and support and num:  
            self.pre_num = self.num  
            self.pre_location = self.location  
            self.pre_support = self.support

            self.num = num  
            self.location = location  
            self.support = support  
        else:  
            break

def confidence_sup(self):  
    "计算confidence"  
    if sum(self.pre_support) == 0:  
        print 'min_support error' # 第一次迭代即失败  
    else:  
        for index_location,each_location in enumerate(self.location):  
            del_num = [each_location[:index] + each_location[index+1:] for index in range(len(each_location))] # 生成上一级频繁项级  
            del_num = [i for i in del_num if i in self.pre_location] # 删除不存在上一级频繁项级子集  
            del_support = [self.pre_support[self.pre_location.index(i)] for i in del_num if i in self.pre_location] # 从上一级支持度查找  
            # print del_num  
            # print self.support[index_location]  
            # print del_support  
            for index,i in enumerate(del_num): # 计算每个关联规则支持度和自信度  
                index_support = 0  
                if len(self.support) != 1:  
                    index_support = index  
                support =  float(self.support[index_location])/self.line_num * 100 # 支持度  
                s = [j for index_item,j in enumerate(self.item_name) if index_item in i]  
                if del_support[index]:  
                    confidence = float(self.support[index_location])/del_support[index] * 100  
                    if confidence > self.min_confidence:  
                        print ','.join(s) , '->>' , self.item_name[each_location[index]] , ' min_support: ' , str(support) + '%' , ' min_confidence:' , str(confidence) + '%'

def main():
c = Apriori('basket.txt', 14, 3, 13)
d = Apriori('simple.txt', 50, 2, 6)

if name == 'main':
main()
############################################################################
Status API Training Shop Blog About
(C) 2014 GitHub, Inc. Terms Privacy Security Contact

Copyright© 2013-2020

All Rights Reserved 京ICP备2023019179号-8