I have a dataframe that has the following data - where "user" column is a string object with same values for multiple rows and refers to a cross section data. As below, in this case, the “,,0004b1b83101a31b2ebd5f8d53d840c06dc63a65,,,Android” occured 48 times. Ideally I should not be reading "user" as string, but by an integer, so that I can load more data (I checked this by getsizeof(object) construct). so -:
- Since my data frame has this repeating entries in “user” , when I try to load the dataframe, I get memory error, unless I sample. I have about 55000 unique “users” ,which is a strong case for optimization.
- I think if I optimize these values (of “user” column) – I can read more data in memory & do the cross section (ie per user) analysis I want to do for a LARGER population , that I can currently.
- The data is passed from a Hadoop dfs via Hive - so I could create indexing before piping this to a csv file. (Tried this , did not work)
What I want to do: Convert the string indexes to integers, like a primary key / associative array(either iteratively, using islice, or if possible before read_csv in the dataframe, (is it possible in hive when I was creating this data(I tried indexing on this "user" column, but it did not work)
The data has been MASKED for user-privacy issues.
filehandle2='/home/ekta/Desktop/BACKUP/DynamicBidding/latestData/latestData/ARI/myfile.csv'
df = pd.read_csv(filehandle2,skiprows=0,sep=',',nrows=500)
df[['user','etime']].ix[40:60]
user etime
40 THIS_IS_A_VERY_LARGE_STRING_OBJECT_1 1393085293
41 THIS_IS_A_VERY_LARGE_STRING_OBJECT_1 1393725859
42 THIS_IS_A_VERY_LARGE_STRING_OBJECT_1 1393225177
43 THIS_IS_A_VERY_LARGE_STRING_OBJECT_1 1393132470
44 THIS_IS_A_VERY_LARGE_STRING_OBJECT_1 1393085160
45 THIS_IS_A_VERY_LARGE_STRING_OBJECT_1 1392874812
46 THIS_IS_A_VERY_LARGE_STRING_OBJECT_1 1393085358
47 THIS_IS_A_VERY_LARGE_STRING_OBJECT_1 1393085538
48 THIS_IS_A_VERY_LARGE_STRING_OBJECT_1 1393725794
49 THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2 1393568680
50 THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2 1393568806
51 THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2 1393568836
52 THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2 1393571938
53 THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2 1393571974
54 THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2 1393572034
55 THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2 1393572088
56 THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2 1393572160
57 THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2 1393572184
58 THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2 1393578604
59 THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2 1393578748
60 THIS_IS_ANOTHER_VERY_LARGE_MASKED_ID_2 1393613116
>>> df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 0 to 499
Data columns:
user 500 non-null values
etime 500 non-null values
dtypes: int64(1), object(1)
I have about 29 million rows in the filehandle2, and I could read about 3 million without any optimization.
What I tried , but did not work : Creating a dict with keys as “unique” values & incrementing the “integer” counter – ie. Fopr every unique key I store a list of corresponding integers (I later convert this list toa pandas Series object& append it to the orignal dataframe, and also drop (not inplace – but re-assigning) the orignal “string “ user column. This does not work, since I still NEED to read through all data. I haven't tried islice so far – since for the Cross sectiona analysis,post creating the corresponding integer keys I will NEED the whole data frame. I plot all the data in last step, and I have about 55000 unique “users” .
I am on Ubuntru 12.04 8Gb RAM,python 2.7.3 – surprisingly when I do the same (ie plain reading of the entire dataframe (with 29 million records) – it works on my windows 8 (4 GB) – no clue why. Though after that the rest of manipulations on the dataframe bring down my Windows & i need to re-boot.
Also saw this question, but unsure, since I assume the OP did not have READING as the use-case for indexing, and assumes that the df can be read in memory Pandas: Use multiple columns of a dataframe as index of another.
I also tried Hive indexing from here(fails on REBUILD from the index created). I am on hive 0.10 http://yogeshsachwani.blogspot.in/2012/09/people-coming-from-rdbms-background.html
EDIT :
In the meanwhile, a colleague suggested to write a simple map-reduce over the streaming lines, which basically increments the "counter" , every time it sees a new "key"(ie. "user") from hive (ie before I read them to a tsv/csv file) - That meets my use-case perfectly, but would appreciate if someone has seen/experimented with more pythonic (and generic) way of handling the duplicated "user" column as above. Note that I want to retain all columns for a "user", just that I need not store it as the string object (which gives sys.getsizeof(MY_STRING_OBJECT) as 79, as opposed to sys.getsizeof(MY_INT_OBJECT) as 12 , there by saving me 29*(10^6)*(79-12) bytes for reading into memory .
Note : I am a not-so precise here with exact bytes needed(as the df would most probably be read in chunks?), but you get the point .