4

System: WIN10

IDE: ANACONDA/Jupyter Lab

Language: Python version 3.7.3

Library: pandas version 1.0.1

Data source: https://grouplens.org/datasets/movielens/

Dataset: movies.csv; ratings.cvs (ml-25m.zip)

I am having an issue for some reason when trying to write a pivot table. The combined table has over 25M records and my code keeps throwing the following error: IndexError: index 993158425 is out of bounds for axis 0 with size 993157686

Steps were taken:

  1. tested shape of the data frame for nan values and cleaned those up
  2. searched online for the error code and could not find anything
  3. tried various ways of writing the pivot table: .pivot, and .pivot_table
  4. looked at crosstab as a workaround: this will not work

Code:

df1_movies = pd.read_csv('Data/movies.csv')
df1_ratings = pd.read_csv('Data/ratings.csv')

df1_main = pd.merge(df1_movies, df1_ratings, on='movieId')
table = df1_main.pivot_table(index='userId', columns='title', values='rating')

error

IndexError: index 993158425 is out of bounds for axis 0 with size 993157686
Alfred Hull
  • 139
  • 11
  • 1) what do you expect the data to look like? 2) which movielens dataset are you using? I just used `ml-latest-small.zip` and didn't get the error. However there are only 5 rows of the pivot table that aren't 100% null... – Anders Swanson Mar 15 '20 at 00:48
  • 3
    this might help. Perhaps the data is to large for pivot tables in pandas at the moment? https://stackoverflow.com/questions/48492451/indexerror-index-1491188345-is-out-of-bounds-for-axis-0-with-size-1491089723 – David Erickson Mar 15 '20 at 00:59
  • @ David Erickson, wow! I hope that isn't the case. I just read through the git repo on this and it seems as if the conversation went stale a year ago :( – Alfred Hull Mar 15 '20 at 02:31
  • @ Anders Swanson, I am going to download that file set now and test it. I was working with the more extensive file set: (ml-25m.zip). As there are no known workarounds for this at the moment, do you know of another platform that handles large matrices? – Alfred Hull Mar 15 '20 at 02:35
  • 3
    Reconsider generating such a wide data frame of every distinct move title in its own column. What analysis do you hope to run with such a setup? – Parfait Mar 15 '20 at 02:50
  • @ Parfait I was trying to build a cross tabular matrix to see which userId ranked which movie. With this matrix, my intent is then to create a tool that will allow me to either draw correlations on rank and or build a clustering tool that will group userId by like movie flavor. – Alfred Hull Mar 16 '20 at 14:52

1 Answers1

1

Thanks to David Erickson pointing to the open issue on this topic:

There is an open Pandas issue describing this error. As of 31AUG2020, the only workaround on this at the moment appears to be in reducing your dataset.

Alfred Hull
  • 139
  • 11