Import unstructured Excel file to database

Question

I have to import data from Excel files into the database. The structure of these files does not match with the structure of the model, so I guess I need to do some kind of data-manipulation to arrange the tuples accordingly.

The files I have to import look like this: excel file

The django-model has the following attributes: Country, Commodity, Year, Value

So what would be the best way to read the data, arrange it in the correct structure and import it into the database (preferably with automatic updates of existing tuples).

I spent a lot of time researching the existing python- and django-libraries for this requirements (like PyExcel, Pandas, Django-Excel, Django-Import-Export), but I couldn't really find out which is the best and if it supports the rearrangement of the data before importing.

I hope you can give me some recommendations and solutions for this task.

What is "the database"? Are you just uploading files to a database, or do you have to do some kind of analysis? What have you tried so far? — Evan, Dec 15 '17 at 18:53
@Evan I just want to get the data from the files into the database, no analysis needed. But as mentioned above, I would need to adapt the structure manually, because the datasets are not separeted for each year. Basically I think that some library wich allows me to read the excel data into a datastructure (array or dict), then edit/adapt this array and then input it into the db would be best. Unfortunately I didn't find information about this from any of the mentioned libraries. — thimma, Dec 16 '17 at 09:52
From what you've said, I'd recommend using the pandas library, and its read_excel() function. You can create a data frame, manipulate it, and output it to CSV or a database (e.g., SQL), although we can't help you if you won't tell us what your database is. — Evan, Dec 16 '17 at 15:16
@Evan Alright I'll look that up. The database is PostgreSQL with django models. Does the db have am impact on how I would implement the functionality? — thimma, Dec 16 '17 at 15:52

score 1 · Answer 1 · answered Dec 16 '17 at 17:13

Okay, a couple of things. Note that I am not an expert in any of them.

First, if your data is an Excel files, it's structured. The nature of a table is to structure data. Unstructured data would be something like a text file or a directory of text files.

Second, please read this and follow it when formulating questions: https://stackoverflow.com/help/mcve

Third, SO is not here to write all of your code for you. Per your question, you've yet to try anything, which is why your question was downvoted.

That preamble aside, you can use Python's pandas library to import Excel files, manipulate them, and upload them to SQL databases.

import pandas as pd

df = pd.read_excel('tmp.xlsx')
df2 = pd.melt(df, id_vars = ['Country', 'Commodity'], var_name = 'Year', value_name = 'Value')
df2

input:

    Country Commodity  2009  2010  2011  2012  2013
0  Austria    Com. 1     1     1     1     1     1
1  Austria    Com. 2     2     2     2     2     2
2  Belgium    Com. 1     3     3     3     3     3
3   France    Com. 1     4     4     4     4     4

output:

   Country Commodity  Year  Value
0  Austria    Com. 1  2009      1
1  Austria    Com. 2  2009      2
2  Belgium    Com. 1  2009      3
3   France    Com. 1  2009      4
4  Austria    Com. 1  2010      1

If you have many Excel files, you can use glob or os.walk to iterate through/over a directory, and import some or all of the Excel files.

You can use pandas DataFrame.to_sql to upload your data to an SQL database, which requires defining the database connection (server address, login/pw, etc.).

Thanks, working fine so far. Unfortunately I've stumbled on another problem: The excel-files have some information at the end of them wich I want to skip. The number of these lines differs from file to file though. The beginning of this information is identified by a row with a specific keyword. I researched a lot about pandas's indexing and how to skip/delete certain rows, but I couldn't find information on the following: How to skip/delete all rows starting with a specific one? — thimma, Dec 17 '17 at 16:00
That is the kind of specific question which is well-suited for its own thread Stack Overflow, especially if you can't find an answer amongst existing questions. That said, here are a couple which I found with a few seconds of Googling: https://chrisalbon.com/python/pandas_selecting_rows_on_conditions.html https://stackoverflow.com/questions/27275236/pandas-best-way-to-select-all-columns-starting-with-x https://www.reddit.com/r/learnpython/comments/5zn29l/pandas_how_to_best_search_for_a_keyword_within/ — Evan, Dec 17 '17 at 19:21

score 0 · Answer 2 · answered Mar 28 '18 at 07:26

This is a example for insert with one foreign key.

    models.py

        from django.db import models

        class Table1(models.Model):
            system = models.CharField(max_length=383)

        class Table2(models.Model):
            name = models.CharField(max_length=383)
            system = models.ForeignKey(Table1, blank=True, null=True, on_delete=models.CASCADE)

You need create a modelResource and override before_import_row for insert data in table1. And you can change name for fields and create tables and customize. 'column_name' is the name of field in excel data, and attribute is name of field in database. And in method nefore_import you can insert data in tables and asign foreignkeys.

And set ModelResource to admin class.

    admin.py
        from .models import Table1,Table2
        from import_export import resources,widgets
        from import_export.fields import Field
        from django.contrib import admin


        class Table2Resource(resources.ModelResource):
            name = Field(column_name='name',attribute='name')
            system_id = Field(column_name='system', attribute='system_id', widget=widgets.ForeignKeyWidget(Table1))

            class Meta:
                model = Table2
                skip_unchanged = True
                fields = ('name','system_id')

            def before_import_row(self,row, **kwargs):
                value = row['system']
                obj = Table1.objects.create(system = value) #create object place
                row['system'] = obj.id # update value to id ob new object


        @admin.register(Table2)
        class Table2Admin(ImportExportModelAdmin,admin.ModelAdmin):
            resource_class = Table2Resource

This is all.

Import unstructured Excel file to database

2 Answers2