How to import special-format data into Elasticsearch?

Question

I have been given a 15 GB .txt file that is formatted like this:

{
  "_score": 1.0,
  "_index": "newsvit",
  "_source": {
    "content": " \u0641\u0647\u06cc\u0645\u0647 \u062d\u0633\u0646\u200c\u0645\u06cc\u0631\u06cc:  ",
    "title": "\u06a9\u0627\u0631\u0647\u0627\u06cc \u0642\u0627\u0644\u06cc\u0628\u0627\u0641 ",
    "lead": "\u062c\u0627\u0645\u0639\u0647&nbsp;&gt;&nbsp;\u0634\u0647\u0631\u06cc - 
    \u0645\u06cc\u0632\u06af\u0631\u062f\u06cc \u062f\u0631\u0628\u0627\u0631\u0647 .",
    "agency": "13",
    "date_created": 1494518193,
    "url": "http://www.khabaronline.ir/(X(1)S(bud4wg3ebzbxv51mj45iwjtp))/detail/663749/society/urban",
    "image": "uploads/2017/05/11/1589793661.jpg",
    "category": "15"
  },
  "_type": "news",
  "_id": "2981643"
}
{
  "_score": 1.0,
  "_index": "newsvit",
  "_source": {
    "content": "\u0645/\u0630",
    "title": "\u0645\u0639\u0646\u0648\u06cc\u062a \u062f\u0631 \u0639\u0635\u0631 ",
    "lead": "\u0645\u062f\u06cc\u0631 \u0645\u0624\u0633\u0633\u0647 \u0639\u0644\u0645\u06cc \u0648 \u067e\u0698\u0648\u0647\u0634\u06cc \u0627\u0628\u0646\u200c\u0633\u06cc\u0646\u0627 \u062f\u0631 .",
    "agency": "1",
    "date_created": 1494521817,
    "url": "http://www.farsnews.com/13960221001386",
    "image": "uploads/2017/05/11/1713799235.jpg",
    "category": "20"
  },
  "_type": "news",
  "_id": "2981951"
}
....

and I want to import it into elasticsearch. I have tried BulkAPI, but since it only accepts a specific style of JSON, I cant convert the whole 15 GB file into Bulk format. I also tried logstash but then the fields like content wouldn't be searchable and queryable.

Whats the most efficient way of importing this file into elasticsearch?

Joe - GMapsBook.com · Answer 1 · 2020-09-25T18:51:18.083

First off, this appears to be a JSON-like export from an ElasticSearch index named news. So if you still have access / can get access to that index, it's possible to reindex from the remote cluster to your own.

With that being said, I'd recommend writing a script (in python for instance, could be bash too) to convert this into a json. Your file is already almost like json -- it's just missing the wrapping brackets [{}, {}, {}] and the commas separating the individual objects.

You can prepend [ and append ] to any text file quite easily. Alternatively, there are also text editors that can miraculously open such large files -- for Mac that'd for example be HexFiend. There's probably a windows alternative too.

Once you've done that you can write a script to split the text file by a regex, say /^\}$/gm -- testable here. Once you split it you can join it back with the comma character , and

save as .json and then use the bulk @data-binary option
or use the bulk DSL API -- example again in python

Getting started with the python loader script:

import json
import re

# load the large text file 
text_str = '...'


my_json_list = map(json.loads,
             # ^^^ iterate & convert to a python dict all objects
                   [
                       # that are contained in the split list
                       lineitem for lineitem in re.split(
                           # by replacing '\n'
                           "\n",
                           # that we introduced after replacing the
                           # current object separators, i.e. '}{' with '\n'
                           re.sub(r'}{', '}\n{',
                                  # after getting rid of the original
                                  # line-breaks
                                  re.sub(r'\n', '', text_str)))
                   ]
            )

how should I split the file by regex? I tried turning the file into string and then using re.split but it doesnt turn complete. could you please set a code example? — lydal, Sep 25 '20 at 17:52
Turns out it's not as easy to split by just `}\n{` because your text file dos contain some other linebreaks that are actually not allowed in a json. Above is the `re.split` example I was talking about. Hope it helps. — Joe - GMapsBook.com, Sep 25 '20 at 18:53

How to import special-format data into Elasticsearch?

1 Answers1