Loading a defaultdict in Hadoop using pickle and sys.stdin

Question

I posted a similar question about an hour ago, but have since deleted it after realising I was asking the wrong question. I have the following pickled defaultdict:

ccollections
defaultdict
p0
(c__builtin__
list
p1
tp2
Rp3
V"I love that"
p4
(lp5
S'05-Aug-13 10:17'
p6
aS'05-Aug-13 10:17'

When using Hadoop, the input is always read in using:

for line in sys.stdin:

I tried reading the pickled defaultdict using this:

myDict = pickle.load(sys.stdin)
for text, date in myDict.iteritems():

But to no avail. The rest of the code works as I tested it locally using .load('filename.txt'). Am I doing this wrong? How can I load the information?

Update:

After following an online tutorial, I can amend my code to this:

def read_input(file):
    for line in file:
        print line

def main(separator='\t'):
    myDict = read_input(sys.stdin)

This prints out each line, showing it is successfully reading the file - however, no semblence of the defaultdict structure is kept, with this output:

p769    

aS'05-Aug-13 10:19' 

p770    

aS'05-Aug-13 15:19' 

p771    

as"I love that"

Obviously this is no good. Does anybody have any suggestions?

score 1 · Accepted Answer · answered Sep 02 '13 at 21:36

1

Why is your input data in the pickle format? Where does your input data come from? One of the goals of Hadoop/MapReduce is to process data that's too large to fit into the memory of a single machine. Thus, reading the whole input data and then trying to deserialize it runs contrary to the MR design paradigm and most likely won't even work with production-scale data sets.

The solution is to format your input data as a, for example, TSV text file with exactly one tuple of your dictionary per row. You can then process each tuple on its own, e.g.:

for line in sys.stdin:
    tuple = line.split("\t")
    key, value = process(tuple)
    emit(key, value)

answered Sep 02 '13 at 21:36

jkovacs

3,470
1
23
24

In another script, the defaultdict was written to file with a series of keys (tweets) mapped to values (times they were posted). This was done so retweets could have multiple times indicating when they were tweeted, as opposed to a regular dictionary which could only have one value per key. Pickle was used just to save this defaultdict to file. – Andrew Martin Sep 02 '13 at 21:38
I know how to use writerow to write to a csv file, but what I'm trying to say is I don't know how to write a defaultdict to file that way – Andrew Martin Sep 02 '13 at 21:44
@AndrewMartin The answer is still valid then: don't use Pickle or you won't be able to sensibly process the data with Hadoop. See [here](http://stackoverflow.com/questions/8685809/python-writing-a-dictionary-to-a-csv-file-with-one-line-for-every-key-value) for an example of how to write a dict to a csv file. – jkovacs Sep 02 '13 at 21:57
Thanks for that link. I think I've nearly got it, but I'm still confused about how to iterate the dictionary. My contents are still in a defaultdict which is good, but when I use reader=csv.reader(sys.stdin) and myDict = dict(x for x in reader), I am able to create a dictionary, but I can't seem to iterate it with iteritems() – Andrew Martin Sep 02 '13 at 22:08
@AndrewMartin defaultdict extends dict, thus you're able to use it exactly the same way as a standard dict, as shown in the answer to the linked question. But I feel that's out of the scope for this question. – jkovacs Sep 02 '13 at 22:12
Thanks. I think I'm nearly there, just not quite. Whilst the code works perfectly out of Hadoop, in Hadoop when I read it like this it processes it, but gets confused by the formatting. Will have to play around a bit. For example, previously I could say for p, d in myDict.iteritems(), then after that for date in d... but Hadoop complains about that – Andrew Martin Sep 02 '13 at 22:32
Therefore, I'm still not sure how to access each value of each key. – Andrew Martin Sep 02 '13 at 22:37

score 0 · Answer 2 · answered Sep 02 '13 at 20:47

0

If you read in the data completely, I believe you can use pickle.loads().

myDict = pickle.loads(sys.stdin.read())

answered Sep 02 '13 at 20:47

Sajjan Singh

2,523
2
27
34

Thanks for this, but unfortunately it doesn't seem to be working. I've deleted everything else from the project now apart from the load of the file and a print statement to confirm that's where the error is, so it's definitely a problem with the loading. – Andrew Martin Sep 02 '13 at 20:53

Loading a defaultdict in Hadoop using pickle and sys.stdin

2 Answers2

Linked