通过 hashlib 查找重复文件?

作者: 多情的公子
来源: 51数据库
2022-10-27

问题描述

我知道以前有人问过这个问题，并且我已经看到了一些答案，但是这个问题更多的是关于我的代码以及完成这项任务的最佳方式.

I know that this question has been asked before, and I've saw some of the answers, but this question is more about my code and the best way of accomplishing this task.

我想扫描一个目录并查看该目录中是否有任何重复项(通过检查 MD5 哈希).以下是我的代码:

I want to scan a directory and see if there are any duplicates (by checking MD5 hashes) in that directory. The following is my code:

import sys
import os
import hashlib

fileSliceLimitation = 5000000 #bytes

# if the file is big, slice trick to avoid to load the whole file into RAM
def getFileHashMD5(filename):
     retval = 0;
     filesize = os.path.getsize(filename)

     if filesize > fileSliceLimitation:
        with open(filename, 'rb') as fh:
          m = hashlib.md5()
          while True:
            data = fh.read(8192)
            if not data:
                break
            m.update(data)
          retval = m.hexdigest()

     else:
        retval = hashlib.md5(open(filename, 'rb').read()).hexdigest()

     return retval

searchdirpath = raw_input("Type directory you wish to search: ")
print ""
print ""    
text_file = open('outPut.txt', 'w')

for dirname, dirnames, filenames in os.walk(searchdirpath):
    # print path to all filenames.
    for filename in filenames:
        fullname = os.path.join(dirname, filename)
        h_md5 = getFileHashMD5 (fullname)
        print h_md5 + " " + fullname
        text_file.write("
" + h_md5 + " " + fullname)   

# close txt file
text_file.close()


print "


Reading outPut:"
text_file = open('outPut.txt', 'r')

myListOfHashes = text_file.read()

if h_md5 in myListOfHashes:
    print 'Match: ' + " " + fullname

这给了我以下输出:

Please type in directory you wish to search using above syntax: /Users/bubble/Desktop/aF

033808bb457f622b05096c2f7699857v /Users/bubble/Desktop/aF/.DS_Store
409d8c1727960fddb7c8b915a76ebd35 /Users/bubble/Desktop/aF/script copy.py
409d8c1727960fddb7c8b915a76ebd25 /Users/bubble/Desktop/aF/script.py
e9289295caefef66eaf3a4dffc4fe11c /Users/bubble/Desktop/aF/simpsons.mov

Reading outPut:
Match:  /Users/bubble/Desktop/aF/simpsons.mov

我的想法是:

1) 扫描目录2)将MD5哈希+文件名写入文本文件3) 以只读方式打开文本文件4) 再次扫描目录并检查文本文件...

1) Scan directory 2) Write MD5 hashes + Filename to text file 3) Open text file as read only 4) Scan directory AGAIN and check against text file...

我发现这不是一个好方法，而且它不起作用.匹配"只是打印出最后处理的文件.

I see that this isn't a good way of doing it AND it doesn't work. The 'match' just prints out the very last file that was processed.

我怎样才能让这个脚本真正找到重复项?有人可以告诉我完成这项任务的更好/更简单的方法.

How can I get this script to actually find duplicates? Can someone tell me a better/easier way of accomplishing this task.

非常感谢您的帮助.抱歉，这篇文章很长.

Thank you very much for any help. Sorry this is a long post.

推荐答案

识别重复项的明显工具是哈希表.除非您正在处理 非常大 数量的文件，否则您可以执行以下操作:

The obvious tool for identifying duplicates is a hash table. Unless you are working with a very large number of files, you could do something like this:

from collections import defaultdict

file_dict = defaultdict(list)
for filename in files:
    file_dict[get_file_hash(filename)].append(filename)

在此过程结束时，file_dict 将包含每个唯一哈希的列表；当两个文件具有相同的哈希值时，它们都会出现在该哈希值的列表中.然后过滤 dict 以查找大于 1 的值列表，并比较文件以确保它们相同 - 如下所示:

At the end of this process, file_dict will contain a list for every unique hash; when two files have the same hash, they'll both appear in the list for that hash. Then filter the dict looking for value lists longer than 1, and compare the files to make sure they're the same -- something like this:

for duplicates in file_dict.values():   # file_dict.itervalues() in Python 2
    if len(duplicates) > 1:
        # double-check reported duplicates and generate output

或者这个:

duplicates = [files for files in file_dict.values() if len(files) > 1]

get_file_hash 可以使用 MD5s；或者它可以像 Ramchandra Apte 在上面的评论中建议的那样简单地获取文件的第一个和最后一个字节；或者它可以简单地使用上面评论中建议的文件大小.不过，后两种策略中的每一种都更有可能产生误报.您可以将它们结合起来以降低误报率.

get_file_hash could use MD5s; or it could simply get the first and last bytes of the file as Ramchandra Apte suggested in the comments above; or it could simply use file sizes as tdelaney suggested in the comments above. Each of the latter two strategies are more likely to produce false positives though. You could combine them to reduce the false positive rate.

如果您正在处理非常大量文件，则可以使用更复杂的数据结构，例如布隆过滤器.

If you're working with a very large number of files, you could use a more sophisticated data structure like a Bloom Filter.