0

I have been given a 15 GB .txt file that is formatted like this:

{
  "_score": 1.0,
  "_index": "newsvit",
  "_source": {
    "content": " \u0641\u0647\u06cc\u0645\u0647 \u062d\u0633\u0646\u200c\u0645\u06cc\u0631\u06cc:  ",
    "title": "\u06a9\u0627\u0631\u0647\u0627\u06cc \u0642\u0627\u0644\u06cc\u0628\u0627\u0641 ",
    "lead": "\u062c\u0627\u0645\u0639\u0647 > \u0634\u0647\u0631\u06cc - 
    \u0645\u06cc\u0632\u06af\u0631\u062f\u06cc \u062f\u0631\u0628\u0627\u0631\u0647 .",
    "agency": "13",
    "date_created": 1494518193,
    "url": "http://www.khabaronline.ir/(X(1)S(bud4wg3ebzbxv51mj45iwjtp))/detail/663749/society/urban",
    "image": "uploads/2017/05/11/1589793661.jpg",
    "category": "15"
  },
  "_type": "news",
  "_id": "2981643"
}
{
  "_score": 1.0,
  "_index": "newsvit",
  "_source": {
    "content": "\u0645/\u0630",
    "title": "\u0645\u0639\u0646\u0648\u06cc\u062a \u062f\u0631 \u0639\u0635\u0631 ",
    "lead": "\u0645\u062f\u06cc\u0631 \u0645\u0624\u0633\u0633\u0647 \u0639\u0644\u0645\u06cc \u0648 \u067e\u0698\u0648\u0647\u0634\u06cc \u0627\u0628\u0646\u200c\u0633\u06cc\u0646\u0627 \u062f\u0631 .",
    "agency": "1",
    "date_created": 1494521817,
    "url": "http://www.farsnews.com/13960221001386",
    "image": "uploads/2017/05/11/1713799235.jpg",
    "category": "20"
  },
  "_type": "news",
  "_id": "2981951"
}
....

and I want to import it into elasticsearch. I have tried BulkAPI, but since it only accepts a specific style of JSON, I cant convert the whole 15 GB file into Bulk format. I also tried logstash but then the fields like content wouldn't be searchable and queryable.

Whats the most efficient way of importing this file into elasticsearch?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
lydal
  • 802
  • 3
  • 15
  • 33

1 Answers1

0

First off, this appears to be a JSON-like export from an ElasticSearch index named news. So if you still have access / can get access to that index, it's possible to reindex from the remote cluster to your own.


With that being said, I'd recommend writing a script (in python for instance, could be bash too) to convert this into a json. Your file is already almost like json -- it's just missing the wrapping brackets [{}, {}, {}] and the commas separating the individual objects.

You can prepend [ and append ] to any text file quite easily. Alternatively, there are also text editors that can miraculously open such large files -- for Mac that'd for example be HexFiend. There's probably a windows alternative too.

Once you've done that you can write a script to split the text file by a regex, say /^\}$/gm -- testable here. Once you split it you can join it back with the comma character , and

  1. save as .json and then use the bulk @data-binary option
  2. or use the bulk DSL API -- example again in python

Getting started with the python loader script:

import json
import re

# load the large text file 
text_str = '...'


my_json_list = map(json.loads,
             # ^^^ iterate & convert to a python dict all objects
                   [
                       # that are contained in the split list
                       lineitem for lineitem in re.split(
                           # by replacing '\n'
                           "\n",
                           # that we introduced after replacing the
                           # current object separators, i.e. '}{' with '\n'
                           re.sub(r'}{', '}\n{',
                                  # after getting rid of the original
                                  # line-breaks
                                  re.sub(r'\n', '', text_str)))
                   ]
            )
Joe - GMapsBook.com
  • 15,787
  • 4
  • 23
  • 68
  • how should I split the file by regex? I tried turning the file into string and then using re.split but it doesnt turn complete. could you please set a code example? – lydal Sep 25 '20 at 17:52
  • Turns out it's not as easy to split by just `}\n{` because your text file dos contain some other linebreaks that are actually not allowed in a json. Above is the `re.split` example I was talking about. Hope it helps. – Joe - GMapsBook.com Sep 25 '20 at 18:53