2

I am trying to import a CSV file into Amazon Personalize

my schema looks like this:

{
  "type": "record",
  "name": "Items",
  "namespace": "com.amazonaws.personalize.schema",
  "fields": [
      {
          "name": "ITEM_ID",
          "type": "string"
      },
      {
          "name": "AUTHOR",
          "type": "string",
          "categorical": true
      },
      {
          "name": "COUNTRY",
          "type": "string",
          "categorical": true
      },
      {
          "name": "CITY",
          "type": "string",
          "categorical": true
      },
      {
          "name": "STYLES",
          "type": "string",
          "categorical": true
      },
      {
          "name": "CATEGORIES",
          "type": "string",
          "categorical": true
      }
  ],
  "version": "1.0"
}

the first few rows of data look like this:

ITEM_ID,AUTHOR,COUNTRY,CITY,STYLES,CATEGORIES
5b4253a7e12434f55875381e,5acd193f48ed4b9b3add5be6,US,city_us_austin,5ad45bc575eb016f3cdb562b|571aa21888a4fd9934f0fd7b|571aa21888a4fd9934f0fd79|5ad45e8c75eb016f3cdb563f|5b4ea35abaa12285687a1f47,593a866a082c26444eab2d3c|5a8e4820fc112d414fbc1be3
5b4253a7e12434f55875381f,5acd193f48ed4b9b3add5be6,US,city_us_jackson,571aa21888a4fd9934f0fd82|57600e419e4959cd069658eb|5ad45c3a75eb016f3cdb5631|571aa21888a4fd9934f0fd7b|57aaa7094a393f531ace43f0|575e6d8e34ca56f742bea1c8|571aa21888a4fd9934f0fd8f,593a866a082c26444eab2d3c|5a8e4820fc112d414fbc1be3

I get the error

Failed to create a data import job for item dataset.
Input csv has rows that do not conform to the dataset schema. Please ensure all required data fields are present and that they are of the type specified in the schema.

How can I figure out what is wrong with the CSV (it's thousands of lines long), so I have not idea if its a general mistake, or something wrong on a specific line?

Matt
  • 412
  • 7
  • 17
  • Does it fail for csv file with 500 records? If yes, then try to remove columns one by one, until only ITEM_ID left or it starts working, so you will know if there is a problem with one of the column. If no, then there is something wrong with one or more rows. You should try to cut it in the half and check if it still doesn't work, then cutting it again and again, until you get set small enough, to analyze. You can also try to analyze it with self-made tool in Python for example – PatrykMilewski Oct 17 '19 at 12:20

2 Answers2

2

In my experience, so long as the dataset is not >250 thousand records, you can still use Excel to check the data utilizing data filters and corresponding search functions. If it's more than that, look into using Notepad++ and RegEx. Your problem may be one of the following things:

(1) There's a missing comma. This would misalign your data and keep it from being processed.
(2) There's a missing ITEM_ID value. For Items, Personalize requires ITEM_ID and at least one metadata field. It might give this error if there is an instance where you are missing ITEM_ID or have ITEM_ID but no other metadata field values.
(3) STYLES and/or CATEGORIES exceeds 256 characters. There is probably a limit on String length, but I can't get a clear answer on this from the developer's guide. I would guess it's 256 characters. If I was betting money, this would be my guess on your problem.

starrywrites
  • 508
  • 2
  • 11
  • it did appear to be item 3. although its not clearly stated in the docs, there seems to be a 256 char limit on string fields. – Matt Oct 21 '19 at 11:36
  • 2
    For future readers, string char limit is 1000 chars as of Jan 2020 – Dev Jan 07 '20 at 15:01
  • 2
    Another possible reason for this type of failure: null values. You want all of the features to have a non-empty value, otherwise, you might get failures without any apparent reason. – orcaman May 27 '20 at 09:37
  • Another case: if you are using pandas to convert to CSV note that in json boolean is false/true vs in python it's False/True. As of now personalize doesn't understand the difference so I had to convert the column values to lowercase for boolean metadata (CSV columns). I was able to debug in the AWS interface by entering one record to the dataset manually. – Saint Mar 11 '21 at 22:08
0

Here is a different approach to solve the problem, maybe will be useful for other cases. I had the same issue, but when dealing with int columns having null values. Pandas by default converts the columns to float data type - something AWS Personalize dataset import job will not accept if you have dedfined these columns as int or long. Long story short, converting these columns to int solves the problem:

df.column_name = df.column_name.astype(pd.Int32Dtype())
Dark Templar
  • 1,175
  • 13
  • 27