0

I have a DataFrame from the below csv content

NAME,VENUE_CITY_NAME,EVENT_LANGUAGE,EVENT_GENRE
satya,Pune,Hindi,|COMEDY|DRAMA|
Amit,National Capital Region,English,|ACTION|ADVENTURE|SCI-FI|
satya,Mumbai,Hindi,|COMEDY|DRAMA|
atul,Bangalore,Tamil,|DRAMA|THRILLER|
atul,Pune,Others,|SPORTS|
alex,Hyderabad,Telugu,|ACTION|ROMANCE|THRILLER|
satya,Bangalore,Malayalam,|DRAMA|SUSPENSE|
dave,Hyderabad,Hindi,|COMEDY|
chris,Bangalore,Telugu,|ACTION|ROMANCE|THRILLER|
satya,Pune,Others,|SPORTS|
dave,Kanpur,Hindi,|COMEDY|DRAMA|
alex,Bangalore,Telugu,|COMEDY|ROMANCE|
amit,Bangalore,Telugu,|ACTION|ROMANCE|THRILLER|
atul,Chennai,Tamil,|COMEDY|ROMANCE|
dave,Bangalore,Telugu,|ACTION|ROMANCE|THRILLER|
alex,Pune,Others,|SPORTS|
chris,Hyderabad,Telugu,|DRAMA|ROMANCE|
satya,National Capital Region,Hindi,|ACTION|COMEDY|
dave,Pune,Others,|SPORTS|
amit,National Capital Region,Others,|SPORTS|

I have to filter the dataframe by levels(with multinodes)and using multiprocessing also

  • LEVEL_1 Filter by city (may be on multiple city in different root nodes)

  • LEVEL-2 Then on that dataframe filter by language(multiple child node)

  • LEVEL-3 FILTER BY GENRE VALUE

Ok I admit that, this can be done by procedural way filtering step by step.

But reason is My Actual Dataframe size is huge, I was asked to consider memory management(so multiprocessing/queueing),reduce processing time, script should be dynamic and generic(so classes and objects)...likewise so many challenges.

So i want to filter the main dataframe at first level(as there can be so many cities to filter so multiple nodes which should be handled by multiprocessing),

Then at second level 2 or multiple sub/child nodes can be found based on language filter condition.so after filtering i need to drop the main dataframe at level1.

At level 3 same should be done like level-2 and the resulted dataframe should be returned to a base by queueing mechanism.

Stefan
  • 41,759
  • 13
  • 76
  • 81
Satya
  • 5,470
  • 17
  • 47
  • 72

1 Answers1

1

If the file is very large, you may be best off reading it in chunks (using the .read_csv() chunksize parameter) and processing accordingly as outlined in the IO docs and mentioned here, and here from a multiprocessing perspective.

To combine the various filters, you could then just use the following (as described here:

cities = ['city1', 'city2', ...]
languages = ['language1', 'language2',...]
genres = ['genre1', 'genre2',...]
df = df[(df.VENUE_CITY_NAME.isin(cities) & (df.EVENT_LANGUAGE.isin(languages) & (df.EVENT_GENRE.isin(genres)]

Things are of course slightly different if you need to parse the genre column for particular genres there could apparently be multiple values.

Community
  • 1
  • 1
Stefan
  • 41,759
  • 13
  • 76
  • 81
  • Thanks Stefan,but i meant to do that in a class object method, and i want all dataframes filtered at last level(in this case level 3) to be sent back to my main process back indivisually(not in appended one). It will be a great help if you suggest something like how to filter level by level(there may be multiple node in 1st level also) then go deep one one level ,filter out ,when reached last level storing them separately.......Approach should be kind of binary tree type(not in exact as there may be 3 4 child node for one parent node )...Thanks for the helpful links. – Satya Dec 04 '15 at 16:49