0

I have a product data table given to me in text file and the format of the data is like below:

**START

IB 1107518415

BI Paperback

BC JHBC

BC JFDD

CO United Kingdom

ED Alvarez, R. Michael

IU 46 b/w illus. 2 maps 18 tables

MP No

PD 20160307

I3 9781107518414

**

Basically, a TAG system is followed where each TAG has a unique 2-digit identifier followed by the relevant data itself. A CR/LF will be at the end of each field.

** Markers is used to indicate the beginning of a new record.

The file is 3 GB and it has 6 million such record, each record having upto 43 fields and some fields having multiple values like the "BC" field in the above sample, I'd need both the values.

What would be the best way to get this file into a collection?

Thanks Mohamed

Mohamed
  • 1
  • 1

1 Answers1

0

This could be done two steps:

Step 1. Convert the file to a usable format (CSV or JSON)

You could use a regular expression to get all the fields such as

/^\*\* START IB (.+) BI (.+) BC (.+) BC (.+) CO (.+) ED (.+) MP (.+) PD (.+) PA (.+) NP (.+) RP (.+) RI (.+) RE (.+) DI (.+) PU (.+) YP (.+) DE (.+) EA (.+) RF (.+) WE (.+) SG (.+) GC (.+) I3 (.+)/

You can see it working here

Step 2. Import the file using mongoimport which is answered here

iiome
  • 1
  • 1
  • Hi, Thanks for the reply. Am i correct in saying that the data across all records should be in the same order as you have written your expression? This may not be the case since there are 43 fields, some records will have some fields, some will not, some records might have multiple values for a field, some will have none etc etc. – Mohamed May 29 '20 at 16:31
  • Correct. If you have multiple values for a field I would recommend using JSON and use an array for them. – iiome May 29 '20 at 21:27
  • I would also use multiple simpler regex to build the JSON part. – iiome May 29 '20 at 21:35