-2

I have a list of sets that contain OrderedDicts that look like this, but the actual list contains ~22,000 elements:

o_dict_list = [(OrderedDict([('StreetNamePreType', 'ROAD'), ('StreetName', 'Coffee')]), 'Ambiguous'),
       (OrderedDict([('StreetNamePreType', 'AVENUE'), ('StreetName', 'Washington')]), 'Ambiguous'),
       (OrderedDict([('StreetNamePreType', 'ROAD'), ('StreetName', 'Quartz')]), 'Ambiguous')]

When I try to convert this list to a Pandas DataFrame using the question and solution noted here, on the entire list, I get the following error:

IndexError: string index out of range

For reference, the line of code that is causing the error is here:

pd.DataFrame([o_dict_list[i][0] for i, j in enumerate(o_dict_list)])

When I trim the list down to 1,000, I can get the DataFrame to populate with no issue. The only issue is when I use the entire list of ~22K elements.

I am using:

Python 3.6.5 :: Anaconda, Inc. pandas==0.23.0 numpy 1.15.2 on a Window's 10 machine.

Does anyone know why I get the IndexError when I use the list of ~22K elements?

Update: As noted below, I was able to resolve this issue by breaking up the list and testing each one. When doing so, I was able to find the part of the list that was causing the code to fail. Thanks for the help.

grantaguinaldo
  • 109
  • 1
  • 3
  • 12

1 Answers1

2

Clearly some of your data is corrupt or invalid or not in the expected format. You say the first 1000 elements are OK, so try the next 10000, and keep bisecting the data until you find the subset which causes the problem.

log2(22000) is less than 15, which is the maximum number of bisections you will need to try to narrow down where your problem is.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436