How to use MapReduce concept to find out duplicate documents in a directory in HDFS

Question

I am using Anaconda, spark 1.3 and hadoop. I have stored a list of xml documents in a particular directory in hdfs.

I have to load that xml documents using a python script to find out the duplicate documents using spark.

Example:

conf = SparkConf().setAppName("Sample").setMaster("local[*]")
sc = SparkContext(conf=conf)
dir = sc.textFile("hdfs://XXXXXXX")
configfiles = [os.path.join(dirpath, f) for dirpath, dirnames, files in os.walk(dir)for f in files if f.endswith('.xml')]

In this I have faced with some error:

TypeError: coercing to Unicode: need string or buffer, RDD found

hdfs://xxxxxx MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:-2

I have used bloom filter to find the duplicates by generating the hash value. That's not a problem here.

By accessing locally stored documents working but not able to process hdfs stored documents.

Could anyone please help me to fix this issue?

Thanks in advance

dtanders · Answer 1 · 2016-01-08T14:49:30.860

1

That error means you're trying to append a non-string type (a Spark RDD) to a string somewhere. If you read the docs, you'll see that sc.textFile is returning an RDD, so you probably can't just pass that to os.walk since it's not a file path. You can try calling collect on the RDD to get a list to iterate over to pass to os.walk:

from collections import DefaultDict
#from http://stackoverflow.com/a/3431835/301807
import hashlib
def hashfile(afile, hasher, blocksize=65536):
    buf = afile.read(blocksize)
    while len(buf) > 0:
        hasher.update(buf)
        buf = afile.read(blocksize)
    return hasher.digest()

dirs = sc.textFile("hdfs://XXXXXXX").collect() #returns a list, not an RDD
configfiles = DefaultDict(list)
for dir in dirs: #for each directory in the list from Spark
    for dirpath, dirnames, files in os.walk(dir): #call os.walk with the directory
        for f in files if f.endswith('.xml'): #iterate over the files in the directory
            path = os.path.join(dirpath, f) #get the file's full path
            digest = hashfile(open(path, 'rb'), hashlib.md5()) #hash the contents
            configfiles[digest].append(path)

And you'll end up with a dictionary mapping MD5 sums to file paths. Any MD5 sum with more than one file path indicates that the files are duplicates. This should print just the duplicates:

for md5, paths in configfiles.items() if len(paths) > 1:
    print("The following files are duplicates of each other: '" + "', '".join(paths) + "'")

edited Jan 08 '16 at 14:49

answered Dec 18 '15 at 14:49

dtanders

1,835
11
13

Thanks, i have got all the files by using collect() – sara Dec 19 '15 at 06:52
Actually i have a 10 xml documents in folder contains in hdfs. i have to process all 10 files to find the duplicate document. By using your code, i have got the content of a single file (print dir) Then how am able to get the file name to mention that as a unique or duplicate – sara Dec 19 '15 at 07:34
no i have got all file contents in a hdfs directory but how to get a file name for each content. The dir have a content only – sara Dec 19 '15 at 07:39
No. I just process under one directory /directory/file1.xml, file2.xml..... filen.xml. The directory have n number of .xml documents. In this, i have to find which documents are duplicate. But Your answer provides the whole xml contents in the directory. – sara Dec 21 '15 at 14:42
No, still i have got the file contents only by printing the 'dir' variable. My question is "/folder" in that list of files(file1,file2,file3....) under hdfs. I have to give "hdfs://127.0.0.9000/folder" (the path which have all files)in python script. From this am generating hash value for each file to find the duplicate one. but using this code i got all files content by printing dir variable. The dir must have the 'folder' name here. The files variable must have "list of files" but by iterating this "FILE" have file1, file2,........ values. – sara Dec 23 '15 at 05:37
Hi dtanders, Could you please tell me about the way to get the files from directory to process. – sara Dec 24 '15 at 09:00
I just want to get the hdfs directory files. If the /data diretcory have 100 documents means , i just want to read all files to find the duplicate one among the 100 documents. By using your script have given the all 100 documents contents only – sara Jan 07 '16 at 09:32
@sara my last edit should read all the files and hash them and I've included an example for how to print the duplicates by path. It seems like this is what you're asking for when you say "i just want to read all files to find the duplicate one among the 100 documents". None of the rest of that comment makes sense to me or is too vague. – dtanders Jan 08 '16 at 14:53
@sare: Are you using hash function as below ? >>> import hashlib >>> hashlib.md5("filename").hexdigest() – Ravindra babu Jan 10 '16 at 07:27

How to use MapReduce concept to find out duplicate documents in a directory in HDFS

1 Answers1