9

Looking to index a CSV file to ElasticSearch, without using Logstash. I am using the elasticsearch-dsl high level library.

Given a CSV with header for example:

name,address,url
adam,hills 32,http://rockit.com
jane,valleys 23,http://popit.com

What will be the best way to index all the data by the fields? Eventually I'm looking to get each row to look like this

{
"name": "adam",
"address": "hills 32",
"url":  "http://rockit.com"
}
bluesummers
  • 11,365
  • 8
  • 72
  • 108
  • It looks like `elasticsearch-dsl` depends on the `elasticsearch-py` library. Checkout [elasticsearch-py's docs](https://elasticsearch-py.readthedocs.io/en/master/#example-usage) on an example of how to insert a document. –  Jan 10 '17 at 17:14
  • The question is not about indexing documents, but about a technique how to index entire .csv files into elasticsearch – bluesummers Jan 10 '17 at 19:06

2 Answers2

41

This kind of task is easier with the lower-level elasticsearch-py library:

from elasticsearch import helpers, Elasticsearch
import csv

es = Elasticsearch()

with open('/tmp/x.csv') as f:
    reader = csv.DictReader(f)
    helpers.bulk(es, reader, index='my-index', doc_type='my-type')
Ashish Gupta
  • 14,869
  • 20
  • 75
  • 134
Honza Král
  • 2,982
  • 14
  • 11
  • This is the kind of answer I was looking for, I will try it in a few hours when and respond accordingly, thanks! – bluesummers Jan 12 '17 at 06:03
  • Exactly the Pythonic and elegant solution I was looking for - Thanks! – bluesummers Jan 12 '17 at 10:17
  • 1
    what about the mapping how to make it so that we know the type of each filed? – Souad May 09 '17 at 10:23
  • minor detail in your snippet: typo in `Elasicsearch` (should be `ElasticSearch`) – Montenegrodr May 09 '17 at 13:12
  • this solution is pretty!!! Is there a same way to directly write csv to elasticsearch using scala without logstash? – Srinathji Kyadari Mar 28 '18 at 12:11
  • How would you add an integer index to the rows of the .csv file that we are uploading in bulk? As of now, the id is a string eg:"yxO4XmQBpSmujbJTJn9n". In my case it is localhost:9200/csvfile/csv/yxO4XmQBpSmujbJTJn9n How would you get /csvfile/csv/1. i.e 1 corresponds to a row in the csv file – shinz4u Jul 03 '18 at 08:42
  • 1
    @shinz4u just wrap the reader in something that will add the desired `id` as `_id` key in the dictionary, then it will be taken up by elasticsearch – Honza Král Jul 04 '18 at 13:48
  • when load large csv file, it always timeout:HTTPConnectionPool(host='localhost', port=9200): Read timed out – seamaner May 29 '19 at 04:21
  • 2
    @seamaner that just means that elasticsearch cannot process the data you are sending fast enough. You can increase the timeout (10s by default) by passing `timeout=N` to `Elasticsearch` when instantiating it (where N > 10) – Honza Král May 30 '19 at 10:32
1

If you want to create elasticsearch database from .tsv/.csv with strict types and model for a better filtering u can do something like that :

class ElementIndex(DocType):
    ROWNAME = Text()
    ROWNAME = Text()

    class Meta:
        index = 'index_name'

def indexing(self):
    obj = ElementIndex(
        ROWNAME=str(self['NAME']),
        ROWNAME=str(self['NAME'])
    )
    obj.save(index="index_name")
    return obj.to_dict(include_meta=True)

def bulk_indexing(args):

    # ElementIndex.init(index="index_name")
    ElementIndex.init()
    es = Elasticsearch()

    //here your result dict with data from source

    r = bulk(client=es, actions=(indexing(c) for c in result))
    es.indices.refresh()
Alex
  • 1,221
  • 2
  • 26
  • 42