1

I've read a python question similar to my problem, but it didn't help.

I have a 0.5-million-line CSV file looking like this:

contract-number,amendment-number,award-date,contract-value,supplier-name,contracting-entity
W8486,0,2014-04-14,14326000,"COMPANY A","Office of Llama Supplies"
W8487,0,2014-04-10,150000,"COMPANY B","Foo Bar Dept"
W8488,2,2014-03-24,146000,"COMPANY C","Armed Forces"
W8488,1,2014-03-03,68000,"COMPANY C","Armed Forces"
W8488,0,2014-02-17,27760,"COMPANY C","Armed Forces"
W8489,0,2014-02-14,51000000,"COMPANY B","Dept of Magical Affairs"

Many contracts appear more than once. I'd like to write a Ruby script to transform my data to a JSON file nesting into the same node those contracts that have the same number like this:

[{"W8486":
  {0:
    {
      "award-date": 2014-04-14,
      "contract-value": 14326000,
      "supplier-name": "COMPANY A",
      "contracting-entity": "Office of Llama Supplies"
    }
  }
},
{"W8487":
  {0:
    {
      "award-date": 2014-04-10,
      "contract-value": 150000,
      "supplier-name": "COMPANY B",
      "contracting-entity": "Foo Bar Dept"
    }
  }
},
{"W8488":
  {2:
    {
      "award-date": 2014-03-24,
      "contract-value": 146000,
      "supplier-name": "COMPANY C",
      "contracting-entity": "Armed Forces"
    }
  },
  {1:
    {
      "award-date": 2014-03-03,
      "contract-value": 68000,
      "supplier-name": "COMPANY C",
      "contracting-entity": "Armed Forces"
    }
  },
  {0:
    {
      "award-date": 2014-02-17,
      "contract-value": 27760,
      "supplier-name": "COMPANY C",
      "contracting-entity": "Armed Forces"
    }
  },
},
{"W8489":
  {0:
    {
      "award-date": 2014-02-14,
      "contract-value": 51000000,
      "supplier-name": "COMPANY B",
      "contracting-entity": "Dept of Magical Affairs"
    }
  }
}]

Up to now, I've managed to iterate through the CSV using CSV.foreach do |line|, putting every item in a hash. I've managed to check whether line[0] == previousContractNumber.

But everytime I write my JSON file, I get this error:

nesting of 100 is too deep (JSON::NestingError)

How could I get around this?

Many thanks!

Community
  • 1
  • 1
jeanhuguesroy
  • 583
  • 5
  • 11
  • 4
    There's probably an error in the way you're constructing your output. It'd help if you'd show your code. – Todd Agulnick Nov 29 '14 at 04:34
  • There could be major scalability problems waiting for you. CSV is read line-by-line, which is very scalable. Typically JSON is read as a single string, which is then parsed into separate objects and those are processed. Converting 500K lines into 500K objects is likely going to consume a lot of memory and slow your script down as memory is allocated, shifted around, and possibly paged. There are SAX-like JSON parsers out there so hopefully you're using one to process the incoming JSON stream. – the Tin Man Nov 29 '14 at 23:33
  • Thank you all so much! @Uri Agassi's solution worked. Your help will greatly help me clean up this data! In the end, I had 262K records, 128K of which shared a contract number with another. Running the script and writing the JSON file took 73 seconds. – jeanhuguesroy Nov 30 '14 at 21:45

1 Answers1

1

Here is some code which should work:

result = Hash.new { |h, k| h[k] = {} }
CSV.foreach do |line|
  result[line[0]][line[1]] = {
          "award-date" => line[2],
          "contract-value" => line[3],
          "supplier-name" => line[4],
          "contracting-entity" => line[5]
  }
end
Uri Agassi
  • 36,848
  • 14
  • 76
  • 93