How to map/index nested Twitter data (json) for Elasticsearch

Question

I have collected a big dataset from Twitter. This file (twitter.json) contains lines like this:

    [ {"created_at":"Sun Apr 16 00:00:10 +0000 2017","id":853397569807958016,"id_str":"853397569807958016","text":"\u3042\u3048\u3066\u8a00\u3046\u3051\u3069\u3001\u6642\u9593\u3060\u3088\uff01\u4f55\u304b\u3059\u308b\u3053\u3068\u3001\u3042\u3063\u305f\u3093\u3058\u3083\u306a\u3044\uff1f(\u30a8\u30b3\u30ed)","source":"\u003ca href=\"http:\/\/makebot.sh\" rel=\"nofollow\"\u003e\u3077\u3088\u3077\u3088\u30c9\u30e9\u30deCDbot\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2230278991,"id_str":"2230278991","name":"\u3077\u3088\u3077\u3088\u304a\u307e\u3051\u30dc\u30a4\u30b9bot","screen_name":"puyo_cd_bot","location":null,"url":"http:\/\/twpf.jp\/puyo_cd_bot","description":"\u53ea\u4eca\u591a\u5fd9\u306e\u305f\u3081\u66f4\u65b0\u304c\u505c\u6ede\u3057\u3066\u3044\u307e\u3059\u3001\u3054\u4e86\u627f\u304f\u3060\u3055\u3044\u3002\u3077\u3088\u3077\u3088\u30c9\u30e9\u30decd\u306e\u304a\u307e\u3051\u30dc\u30a4\u30b9\u306e\u5b9a\u671f\u3064\u3076\u3084\u304d\u3001\u4e00\u90e8\u30ea\u30d7\u30e9\u30a4\u3067\u306e\u53cd\u5fdc\u3092\u8003\u3048\u3066\u3044\u307e\u3059\u3002\u975e\u516c\u5f0f\u3002\u767b\u9332\u6e08\u307f\u30ad\u30e3\u30e9\u306a\u3069\u8a73\u3057\u304f\u306f\u3064\u3044\u3077\u308d\u306b\u3066","protected":false,"verified":false,"followers_count":181,"friends_count":115,"listed_count":3,"favourites_count":0,"statuses_count":44139,"created_at":"Wed Dec 04 17:43:08 +0000 2013","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"ja","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/378800000837231449\/beca2ed4c8ce917b37dcbe188d0f9e31_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/378800000837231449\/beca2ed4c8ce917b37dcbe188d0f9e31_normal.jpeg","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"ja","timestamp_ms":"1492300810659"}
    , { ... }
    ...  
    , { ... } 
    ]

To give you a better visual, the 1st tweet line looks like this after validation:

   {"created_at": "Sun Apr 16 00:00:10 +0000 2017",
    "id": 853397569807958016,
    "id_str": "853397569807958016",
    "text": "\u3042\u3048\u3066\u8a00\u3046\u3051\u3069\u3001\u6642\u9593\u3060\u3088\uff01\u4f55\u304b\u3059\u308b\u3053\u3068\u3001\u3042\u3063\u305f\u3093\u3058\u3083\u306a\u3044\uff1f(\u30a8\u30b3\u30ed)",
    "source": "\u003ca href=\"http:\/\/makebot.sh\" rel=\"nofollow\"\u003e\u3077\u3088\u3077\u3088\u30c9\u30e9\u30deCDbot\u003c\/a\u003e",
    "truncated": false,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "user": {
        "id": 2230278991,
        "id_str": "2230278991",
        "name": "\u3077\u3088\u3077\u3088\u304a\u307e\u3051\u30dc\u30a4\u30b9bot",
        "screen_name": "puyo_cd_bot",
        "location": null,
        "url": "http:\/\/twpf.jp\/puyo_cd_bot",
        "description": "\u53ea\u4eca\u591a\u5fd9\u306e\u305f\u3081\u66f4\u65b0\u304c\u505c\u6ede\u3057\u3066\u3044\u307e\u3059\u3001\u3054\u4e86\u627f\u304f\u3060\u3055\u3044\u3002\u3077\u3088\u3077\u3088\u30c9\u30e9\u30decd\u306e\u304a\u307e\u3051\u30dc\u30a4\u30b9\u306e\u5b9a\u671f\u3064\u3076\u3084\u304d\u3001\u4e00\u90e8\u30ea\u30d7\u30e9\u30a4\u3067\u306e\u53cd\u5fdc\u3092\u8003\u3048\u3066\u3044\u307e\u3059\u3002\u975e\u516c\u5f0f\u3002\u767b\u9332\u6e08\u307f\u30ad\u30e3\u30e9\u306a\u3069\u8a73\u3057\u304f\u306f\u3064\u3044\u3077\u308d\u306b\u3066",
        "protected": false,
        "verified": false,
        "followers_count": 181,
        "friends_count": 115,
        "listed_count": 3,
        "favourites_count": 0,
        "statuses_count": 44139,
        "created_at": "Wed Dec 04 17:43:08 +0000 2013",
        "utc_offset": null,
        "time_zone": null,
        "geo_enabled": false,
        "lang": "ja",
        "contributors_enabled": false,
        "is_translator": false,
        "profile_background_color": "C0DEED",
        "profile_background_image_url": "http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png",
        "profile_background_image_url_https": "https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png",
        "profile_background_tile": false,
        "profile_link_color": "1DA1F2",
        "profile_sidebar_border_color": "C0DEED",
        "profile_sidebar_fill_color": "DDEEF6",
        "profile_text_color": "333333",
        "profile_use_background_image": true,
        "profile_image_url": "http:\/\/pbs.twimg.com\/profile_images\/378800000837231449\/beca2ed4c8ce917b37dcbe188d0f9e31_normal.jpeg",
        "profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/378800000837231449\/beca2ed4c8ce917b37dcbe188d0f9e31_normal.jpeg",
        "default_profile": true,
        "default_profile_image": false,
        "following": null,
        "follow_request_sent": null,
        "notifications": null
    },
    "geo": null,
    "coordinates": null,
    "place": null,
    "contributors": null,
    "is_quote_status": false,
    "retweet_count": 0,
    "favorite_count": 0,
    "entities": {
        "hashtags": [],
        "urls": [],
        "user_mentions": [],
        "symbols": []
    },
    "favorited": false,
    "retweeted": false,
    "filter_level": "low",
    "lang": "ja",
    "timestamp_ms": "1492300810659"
   }

Problem:

I tried to import this .json file to elasticsearch by using the following command line:

curl -XPOST 'http://localhost:9200/twitter/tweet/1' --data-binary "@/Users/jz/Documents/elasticsearch-5.3.0/twitter.json"

but it gives me this error:

**{"error":"Content-Type header [application/x-www-form-urlencoded] is not supported","status":406}**

I used the following code too but still failed:

curl --header "Content-Type:application/json"  -XPOST 'http://localhost:9200/twitter/tweet/1' --data-binary "@/Users/jz/Documents/elasticsearch-5.3.0/twitter.json"

The error message is:

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}],"type":"mapper_parsing_exception","reason":"failed to parse","caused_by":{"type":"not_x_content_exception","reason":"Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"}},"status":400}: {"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}],"type":"mapper_parsing_exception","reason":"failed to parse","caused_by":{"type":"not_x_content_exception","reason":"Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"}},"status":400}

I tried to change it to -d too, but again failed:

curl -XPOST 'http://localhost:9200/twitter/tweet/1' -d "@/Users/jz/Documents/elasticsearch-5.3.0/twitter.json"

The error message is the same as the one using --data-binary:

Updated:

Since using curl gives too much trouble, I decided to use Python's elasticsearch library. After successful connection to the local host, I used something like this to index the sample data:

es = Elasticsearch([{"host": "localhost", "port":9200}])
with open('sample0.json') as json_data:
    json_docs = json.load(json_data)
    for json_doc in json_docs:
        my_id = json_doc.pop('_id', None)
        es.index(index='testdata', doc_type='generated', id=my_id, body=json.dumps(json_doc))

Error: C:\Anaconda2\lib\json\decoder.pyc in raw_decode(self, s, idx) 378 """ 379 try: --> 380 obj, end = self.scan_once(s, idx) 381 except StopIteration: 382 raise ValueError("No JSON object could be decoded")

ValueError: Expecting , delimiter: line 1 column 2241 (char 2240)

Can someone please give me some guidance? Thanks!

Can you provide the json file content around line1 char 2240 ? — Damien Ferey, Apr 25 '17 at 08:41

Random · Answer 1 · 2017-04-25T10:31:43.053

0

You can convert it to array by joining each line with , symbol and surrounding it by square brackets ([]) like this

'[' + s.join(',') + ']'

If you need to validate them separately, you should pass s[i] to JSON.parse function rather than sTemp.

Update

If you need to create a string to pass to ElasticSearch, you should convert your list of JSON objects to the following file:

{"index":{"_index":"my_index","_type":"tweet","_id":null}}
{"created_at":"Sun Apr 16 00:00:10 +0000 2017","id":1, ... }
{"index":{"_index":"my_index","_type":"tweet","_id":null}}
{"created_at":"Sun Apr 16 00:00:10 +0000 2017","id":2, ... }

and pass its content to ElasticSearch so that before each document line you have the command line to index your document

{"index":{"_index":"my_index","_type":"tweet","_id":null}}

Take a look here https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

edited Apr 25 '17 at 10:31

answered Apr 22 '17 at 13:30

Random

3,807
2
30
49

Hi @Random, thanks for your prompt answer! Very helpful. I now changed my text to be something like [ {...}, {...}], and it worked. However, another problem came up. Please see my updated question! – Sunshine Apr 23 '17 at 09:09
It seems that you have a lot of separate issues here. If you need to put your mapping, take a look here first https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html, if you need to insert the data, take a look at bulk API. Or specify in more details what is the current problem – Random Apr 24 '17 at 08:18
thanks for your comment. Yes, the problem is with the tweet. Each line is a nested json. Since the command line curl is giving so many problems, I wonder if it is possible to use the elasticsearch library in Python? Please see my updated post above. Thanks again! – Sunshine Apr 25 '17 at 07:42
It's ok to use curl. But you need to preprocess your file to get the correct input for ElasticSearch. See updated answer – Random Apr 25 '17 at 10:35

score 0 · Answer 2 · edited May 23 '17 at 12:17

0

You could provide a header to indicate the request is in JSON format:

curl --header "Content-Type:application/json" ...

Alternatively, you can use "-d" instead of "--data-binary".

And as explained here : https://stackoverflow.com/a/35213617/5520709 embed your array with {"root":[...]} to have a valid json object

please note this will index your whole json as à single document, which us perhaps not what you want. If you want to index a document per tweet, you may want to use bulk API : - https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

edited May 23 '17 at 12:17

Community

1
1

answered Apr 23 '17 at 10:57

Damien Ferey

99
6

Hi Damien, thanks for your answers but neither worked in my terminal. I posted the error message and updated my question above. Please have a look :) – Sunshine Apr 24 '17 at 00:01
Ok, for this new error (not_x_content_exception), you'll find a solution here : http://stackoverflow.com/a/35213617/5520709 (embed your array with {root:[...]}) – Damien Ferey Apr 24 '17 at 03:15
thanks Damien. Just tried but still didn't work: {"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}],"type":"mapper_parsing_exception","reason":"failed to parse","caused_by":{"type":"json_parse_exception","reason":"Unexpected character ('r' (code 114)): was expecting double-quote to start field name\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@72bbeb6b; line: 1, column: 3]"}},"status":400} – Sunshine Apr 24 '17 at 04:10
Ok, try with {"root":[...]} – Damien Ferey Apr 24 '17 at 05:25
(please note this will index your whole json as à single document, which us perhaps not what you want. If you want to index a document per tweet, you may want to use bulk API : https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html) – Damien Ferey Apr 24 '17 at 05:32
thanks for your advice! I tried the root way too but still not working. Just updated my post using Python. Please have a look. Cheers – Sunshine Apr 25 '17 at 08:00

How to map/index nested Twitter data (json) for Elasticsearch

2 Answers2