1

I have a dataframe that has the following data - where "user" column is a string object with same values for multiple rows and refers to a cross section data. As below, in this case, the “,,0004b1b83101a31b2ebd5f8d53d840c06dc63a65,,,Android” occured 48 times. Ideally I should not be reading "user" as string, but by an integer, so that I can load more data (I checked this by getsizeof(object) construct). so -:

  1. Since my data frame has this repeating entries in “user” , when I try to load the dataframe, I get memory error, unless I sample. I have about 55000 unique “users” ,which is a strong case for optimization.
  2. I think if I optimize these values (of “user” column) – I can read more data in memory & do the cross section (ie per user) analysis I want to do for a LARGER population , that I can currently.
  3. The data is passed from a Hadoop dfs via Hive - so I could create indexing before piping this to a csv file. (Tried this , did not work)

What I want to do: Convert the string indexes to integers, like a primary key / associative array(either iteratively, using islice, or if possible before read_csv in the dataframe, (is it possible in hive when I was creating this data(I tried indexing on this "user" column, but it did not work)

The data has been MASKED for user-privacy issues.

filehandle2='/home/ekta/Desktop/BACKUP/DynamicBidding/latestData/latestData/ARI/myfile.csv'
df = pd.read_csv(filehandle2,skiprows=0,sep=',',nrows=500)

df[['user','etime']].ix[40:60]
                                                      user       etime
40  THIS_IS_A_VERY_LARGE_STRING_OBJECT_1  1393085293
41  THIS_IS_A_VERY_LARGE_STRING_OBJECT_1  1393725859
42  THIS_IS_A_VERY_LARGE_STRING_OBJECT_1  1393225177
43  THIS_IS_A_VERY_LARGE_STRING_OBJECT_1  1393132470
44  THIS_IS_A_VERY_LARGE_STRING_OBJECT_1  1393085160
45  THIS_IS_A_VERY_LARGE_STRING_OBJECT_1  1392874812
46  THIS_IS_A_VERY_LARGE_STRING_OBJECT_1  1393085358
47  THIS_IS_A_VERY_LARGE_STRING_OBJECT_1  1393085538
48  THIS_IS_A_VERY_LARGE_STRING_OBJECT_1  1393725794
49  THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2  1393568680
50  THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2  1393568806
51  THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2  1393568836
52  THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2  1393571938
53  THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2 1393571974
54  THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2  1393572034
55  THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2 1393572088
56 THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2 1393572160
57  THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2  1393572184
58  THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2  1393578604
59  THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2  1393578748
60  THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2  1393613116

>>> df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 0 to 499
Data columns:
user     500  non-null values
etime    500  non-null values
dtypes: int64(1), object(1)

I have about 29 million rows in the filehandle2, and I could read about 3 million without any optimization.

What I tried , but did not work : Creating a dict with keys as “unique” values & incrementing the “integer” counter – ie. Fopr every unique key I store a list of corresponding integers (I later convert this list toa pandas Series object& append it to the orignal dataframe, and also drop (not inplace – but re-assigning) the orignal “string “ user column. This does not work, since I still NEED to read through all data. I haven't tried islice so far – since for the Cross sectiona analysis,post creating the corresponding integer keys I will NEED the whole data frame. I plot all the data in last step, and I have about 55000 unique “users” .

I am on Ubuntru 12.04 8Gb RAM,python 2.7.3 – surprisingly when I do the same (ie plain reading of the entire dataframe (with 29 million records) – it works on my windows 8 (4 GB) – no clue why. Though after that the rest of manipulations on the dataframe bring down my Windows & i need to re-boot.

Also saw this question, but unsure, since I assume the OP did not have READING as the use-case for indexing, and assumes that the df can be read in memory Pandas: Use multiple columns of a dataframe as index of another.

I also tried Hive indexing from here(fails on REBUILD from the index created). I am on hive 0.10 http://yogeshsachwani.blogspot.in/2012/09/people-coming-from-rdbms-background.html

EDIT :

In the meanwhile, a colleague suggested to write a simple map-reduce over the streaming lines, which basically increments the "counter" , every time it sees a new "key"(ie. "user") from hive (ie before I read them to a tsv/csv file) - That meets my use-case perfectly, but would appreciate if someone has seen/experimented with more pythonic (and generic) way of handling the duplicated "user" column as above. Note that I want to retain all columns for a "user", just that I need not store it as the string object (which gives sys.getsizeof(MY_STRING_OBJECT) as 79, as opposed to sys.getsizeof(MY_INT_OBJECT) as 12 , there by saving me 29*(10^6)*(79-12) bytes for reading into memory .

Note : I am a not-so precise here with exact bytes needed(as the df would most probably be read in chunks?), but you get the point .

Community
  • 1
  • 1
ekta
  • 1,560
  • 3
  • 28
  • 57
  • So, your machine runs out of memory at 3 million records -- Even if you make your data 8 times smaller it is still a very close call for 29 million records. – grasshopper Apr 15 '14 at 08:55

2 Answers2

0

Why don't you create a hash for each string and save this in the dataframe instead of the actual string? And then have a dictionary that maps from the string to the hash.

  def custom_hash(x,d):
     d[x] = hash(x)
     return d[x]

 d = {}
 df = pd.read_csv("file.csv", converters={"user": lambda x:custom_hash(x,d)})

Now you will have the dataframe populated with long ids (instead of strings) and a dictionary d which maps from the string id to the long id.

Another optimization idea is to read the file in chunks:

Loading big CSV file with pandas

Community
  • 1
  • 1
grasshopper
  • 3,988
  • 3
  • 23
  • 29
  • I have tried the dictionary approach, where I load the "keys" from user and map it back to integer counter - the problem though is that I have to be ABLE to read the whole data frame first. Also the hash(string) will still be a 79 bytes, compared to 12 bytes for int. On trying the islice(recursive read), I needed to create "set" and memory(again a dict) - to map it back to the integer counters. Was looking at something more pythonic. – ekta Apr 15 '14 at 10:47
  • 1
    The hash is 24 bytes in my machine (it is a long). I am not sure if the converter does the conversion on the fly, or after the whole csv has been read. – grasshopper Apr 15 '14 at 11:06
0

Here's what worked for me -

I translated this into a simple Map-reduce job, by using primary key as "int", and essentially, incrementing the counter after seeing a new "user row" at combine step, like I had originally wanted. Was way simpler than I had thought at first. "\x7C" refers to tab, and you will need to do some try-catch to handle corrupt rows.

A lot of folks forget try, catch when dealing with raw data in an MR job, this could be disastrous, if you are unsure about your data having corrupt records/ Broken Input lines . Like in this case, trying to split to get 3 entities.

# Usage : Translate string of userid to its Integer primary key representation 

#userid_mapper.py
#!/usr/bin/env python

import sys
for line in sys.stdin:
   line = line.strip()
   try :
       user_string, dat, appcat = line.split("\x7C")
       print "%s\t%s\t%s" % (user_string, dat, appcat)
   except Exception as e :
       user_string, dat, appcat = Unspecified, Unspecified, Unspecified
       print "%s\t%s\t%s" % (user_string, dat, appcat)

I figured out that just "Pass" in Exception cases, where your input might fail on line.split("\x7C") to return 3 entities, does not always work. So use, "Unspecified" in your MR job and later just discard it from the interpretation in your reduce step/ Final Interpretation step.

#userid_reducer.py
#!/usr/bin/env python

from operator import itemgetter
import sys

current_user = None
counter = 0

for line in sys.stdin:
  line = line.strip()
  if len(line.split('\t')) ==3: 
    user_string, dat,appcat = line.split('\t')

    if current_user != user_string:
      counter += 1

    print "%d\t%s\t%s" %(counter, dat, appcat)
    current_user = user_string 


## To translate to primary keys ##
/usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.6.0.jar \
-D mapred.reduce.tasks=10 \
-file /root/ekta/userid_mapper.py -mapper /root/ekta/userid_mapper.py \
-file /root/ekta/userid_reducer.py -reducer /root/ekta/userid_reducer.py \
-input /user/hive/warehouse/app_analysis.db/InputTable\
-output /user/hive/warehouse/app_analysis.db/OutputTable
ekta
  • 1,560
  • 3
  • 28
  • 57