0

This might endup in very silly question, but being a newbie in python i am not able to find a good solution to following problem.

class Preprocessor:
    mPath = None;
    df = None;

    def __init__(self, path):
        self.mPath = path;


    def read(self):
        self.df = pd.read_csv(self.mPath);
        return self.df;

    def __findUniqueGenres(self):
        setOfGenres = set();
        for index, genre in self.df['genres'].iteritems():
            listOfGenreInMovie = genre.lower().split("|");
            for i, _genre in np.ndenumerate(listOfGenreInMovie):
                setOfGenres.add(_genre)
        return setOfGenres;

    def __prepareDataframe(self, genres):
        all_columns = set(["title", "movieId"]).union(genres)
        _df = pd.DataFrame(columns=all_columns)
        return _df;

    def __getRowTemplate(self, listOfColumns):
        _rowTemplate = {}
        for col in listOfColumns:
            _rowTemplate[col] = 0
        return _rowTemplate;

    def __createRow(self, rowTemplate, row):
        rowTemplate['title'] = row.title;
        rowTemplate['movieId'] = row.movieId;
        movieGenres = row.genres.lower().split("|");
        for movieGenre in movieGenres:
            rowTemplate[movieGenre] = 1;
        return rowTemplate;

    def tranformDataFrame(self):
        genres = self.__findUniqueGenres();
        print('### List of genres...', genres);
        __df = self.__prepareDataframe(genres); # Data frame with all required columns.
        rowTemplate = self.__getRowTemplate(__df.columns)
        print('### Row template looks like -->', rowTemplate)
        collection = []
        for index, row in self.df.iterrows():
        _rowToAdd=self.__createRow(rowTemplate, row);
            print('### Row looks like', _rowToAdd)
            collection.append(_rowToAdd)

        print('### Collection looks like', collection)
        return __df.append(collection)

Here when i am trying to append a _rowToAdd to collection, it endsup having a collection of last item ( last row of self.df).

Below are logs for the same (self.df has 3 rows here),

### List of genres... {'mystery', 'horror', 'comedy', 'drama', 'thriller', 'children', 'adventure'}
### Row template looks like --> {'title': 0, 'horror': 0, 'comedy': 0, 'drama': 0, 'children': 0, 'mystery': 0, 'movieId': 0, 'thriller': 0, 'adventure': 0}
### Row looks like {'title': 'Big Night (1996)', 'horror': 0, 'comedy': 1, 'drama': 1, 'children': 0, 'mystery': 0, 'movieId': 994, 'thriller': 0, 'adventure': 0}
### Row looks like {'title': 'Grudge, The (2004)', 'horror': 1, 'comedy': 1, 'drama': 1, 'children': 0, 'mystery': 1, 'movieId': 8947, 'thriller': 1, 'adventure': 0}
### Row looks like {'title': 'Cheetah (1989)', 'horror': 1, 'comedy': 1, 'drama': 1, 'children': 1, 'mystery': 1, 'movieId': 2039, 'thriller': 1, 'adventure': 1}
### Collection looks like [{'title': 'Cheetah (1989)', 'horror': 1, 'comedy': 1, 'drama': 1, 'children': 1, 'mystery': 1, 'movieId': 2039, 'thriller': 1, 'adventure': 1}, {'title': 'Cheetah (1989)', 'horror': 1, 'comedy': 1, 'drama': 1, 'children': 1, 'mystery': 1, 'movieId': 2039, 'thriller': 1, 'adventure': 1}, {'title': 'Cheetah (1989)', 'horror': 1, 'comedy': 1, 'drama': 1, 'children': 1, 'mystery': 1, 'movieId': 2039, 'thriller': 1, 'adventure': 1}]

I want my collection to like

### [
{'title': 'Big Night (1996)', 'horror': 0, 'comedy': 1, 'drama': 1, 'children': 0, 'mystery': 0, 'movieId': 994, 'thriller': 0, 'adventure': 0},
{'title': 'Grudge, The (2004)', 'horror': 1, 'comedy': 0, 'drama': 0, 'children': 0, 'mystery': 1, 'movieId': 8947, 'thriller': 1, 'adventure': 0},
{'title': 'Cheetah (1989)', 'horror': 0, 'comedy': 0, 'drama': 0, 'children': 1, 'mystery': 0, 'movieId': 2039, 'thriller': 0, 'adventure': 1}
]

Dataset - https://grouplens.org/datasets/movielens/

Gaurav Gupta
  • 4,586
  • 4
  • 39
  • 72
  • Please supply an [MCV](https://stackoverflow.com/help/mcve) – jpp Jan 26 '18 at 19:22
  • 1
    What does self.__createRow(rowTemplate, row) do? If it involves copying dictionaries using `dict1 = dict2` syntax, you might end up with the last item on the collection. See this https://stackoverflow.com/questions/2465921/how-to-copy-a-dictionary-and-only-edit-the-copy – Ram Jan 26 '18 at 19:32

1 Answers1

0

I got to understand the issue now, i was trying to mutate the dictionary object.

def tranformDataFrame(self):
    genres = self.__findUniqueGenres();
    print('### List of genres...', genres);
    __df = self.__prepareDataframe(genres); # Data frame with all required columns.
    rowTemplate = self.__getRowTemplate(__df.columns)
    print('### Row template looks like -->', rowTemplate)
    collection = []
    for index, row in self.df.iterrows():
        # Creating the fresh copy of row template every time prevent mutation. 
        _rowToAdd = self.__createRow(self.__getRowTemplate(__df.columns), row);
        print('### Row looks like', _rowToAdd)
        collection.append(_rowToAdd)

    print('### Collection looks like', collection)
    return __df.append(collection)

Although there must be some way to cache the copy and cloning it every time ( instead of processing some logic, and creating a dictionary). But, this solution resolve this particular issue at-least.

Gaurav Gupta
  • 4,586
  • 4
  • 39
  • 72