2

I have to read an online csv-file into a postgres database, and in that context I have some problems reading the online csv-file properly.

If I just import the file it reads as bytes, so I have to decode it. During the decoding it, however, seems that the entire file is turned into one long string.

# Libraries
import csv
import urllib.request

# Function for importing csv from url
def csv_import(url):
    url_open = urllib.request.urlopen(url)
    csvfile = csv.reader(url_open.decode('utf-8'), delimiter=',') 
    return csvfile;

# Reading file
p_pladser = csv_import("http://wfs-kbhkort.kk.dk/k101/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=k101:p_pladser&outputFormat=csv&SRSNAME=EPSG:4326")

When I try to read the imported file line by line it only reads one character at the time.

for row in p_pladser:
    print(row)
    break

['F']

Can you help me identify where it goes wrong? I am using Python 3.6.

EDIT: Per request my solution in R

# Loading library
library(RPostgreSQL)

# Reading dataframe
p_pladser = read.csv("http://wfs-kbhkort.kk.dk/k101/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=k101:p_pladser&outputFormat=csv&SRSNAME=EPSG:4326", encoding = "UTF-8", stringsAsFactors = FALSE)

# Creating database connection
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "secretdatabase", host = "secrethost", user = "secretuser", password = "secretpassword")

# Uploading dataframe to postgres database
dbWriteTable(con, "p_pladser", p_pladser , append = TRUE, row.names = FALSE, encoding = "UTF-8")

I have to upload several tables for 10,000 to 100,000 rows, and it total in R it takes 1-2 seconds to upload them all.

Esben Eickhardt
  • 3,183
  • 2
  • 35
  • 56

2 Answers2

3

csv.reader expect as argument a file like object and not a string. You have 2 options here:

  • either you read the data into a string (as you currently do) and then use a io.StringIO to build a file like object around that string:

    def csv_import(url):
        url_open = urllib.request.urlopen(url)
        csvfile = csv.reader(io.StringIO(url_open.read().decode('utf-8')), delimiter=',') 
        return csvfile;
    
  • or you use a io.TextIOWrapper around the binary stream provided by urllib.request:

    def csv_import(url):
        url_open = urllib.request.urlopen(url)
        csvfile = csv.reader(io.TextIOWrapper(url_open, encoding = 'utf-8'), delimiter=',') 
        return csvfile;
    
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
2

How about loading the CSV with pandas!

import pandas as pd
csv = pd.read_csv("http://wfs-kbhkort.kk.dk/k101/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=k101:p_pladser&outputFormat=csv&SRSNAME=EPSG:4326")
print csv.columns

OR if you have the CSV downloaded in your machine, then directly

csv = pd.read_csv("<path_to_csv>")

Ok! You may consider passing delimiter and quotechar arguments to csv.reader, because the CSV contains quotes as well! Something like this,

with open('p_pladser.csv') as f:
 rows = csv.reader(f, delimiter=',', quotechar='"')
 for row in rows:
   print(row)
  • That is the first thing I did, and it works like a charm. However, when I have to upload the data.frame to a postgresql database it is extremely slow unless, I convert the data.frame back into a csv-format. I am used to working in R, where this would be a two liner problem: 1) Read data into data.frame, 2) export data.frame to postgresql. There must be some elegant way in python too, but I have not been able to find it yet. – Esben Eickhardt Nov 27 '17 at 14:16
  • @EsbenEickhardt pandas has a option to convert a dataframe back to csv.[Here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv) – optimalic Nov 27 '17 at 14:24
  • @EsbenEickhardt how do you mean when you say you have to upload vs export df to postgresql? does R upload that for you inside your `export` function ? if you could provide a bit more detail might be able to get further in this; the thing is, what caused your slowness in uploading/exporting? to me uploading (remote) is way different than exporting (local). – stucash Nov 27 '17 at 14:25
  • @MushroomMauLa I cannot make the conversion to csv work. dataframe.to_csv just results in another pandas dataframe it seems. – Esben Eickhardt Nov 27 '17 at 14:32
  • @stucash It is to a remote database, what I have been using sqlalchemy and the function to_sql(). It times out, while the dbWriteTable() function in R both creates the schemas and uploads a dataframe i milliseconds. – Esben Eickhardt Nov 27 '17 at 14:36
  • Guess, this link has similar solution https://stackoverflow.com/questions/24189150/pandas-writing-dataframe-to-other-postgresql-schema – Vivek Harikrishnan Nov 27 '17 at 14:37
  • @EsbenEickhardt you have to provide a file path/name dataframe.to_csv(path_or_buf=file_name, sep=',') – optimalic Nov 27 '17 at 14:39
  • @MushroomMauLa I am not sure that you mean by that? I have imported the dataframe pandas with the name "df", and then tried to convert the data.frame to csv by writing "df.to_csv". Is this wrong? – Esben Eickhardt Nov 27 '17 at 14:44
  • @VivekHarikrishnan I used that exact solution, and it took hours to export my dataframes to the remote database. That same in R took between one and two seconds. Unfortunately our external partners why have to maintain the code insist on it being written in python. – Esben Eickhardt Nov 27 '17 at 14:46
  • @EsbenEickhardt you are right in doing that, but with brackets: `df.to_csv(csv_file_path_to_export_to, [optional_args])`, you will then see at the location you specified, there's a csv file popped up. – stucash Nov 27 '17 at 14:47
  • No but you have to save the output of df.to_csv to a file. Maybe there is a better option I'm not really familiar with pandas – optimalic Nov 27 '17 at 14:47
  • @So I have to import the file from the url into a pandas dataframe, and then export the dataframe to a local csv-file, and then read that file into python again, and then export that to the database? It seems inefficient. Can one not save the csv-file into an object rather than having to write it into a file, and then import the file? – Esben Eickhardt Nov 27 '17 at 14:49
  • @EsbenEickhardt you can. I can't write it here, so give you a link: https://stackoverflow.com/questions/23231605/convert-pandas-dataframe-to-csv-string; this will retain csv string as a string object while you do `to_csv()`. you can then manipulate it at a later stage. – stucash Nov 27 '17 at 14:53
  • @EsbenEickhardt if you haven't tried already, you could also chunkify your sql to prevent timeout: `df.to_sql('table', engine, chunksize=300000). if this works reasonably fast, you could go with this solution. otherwise, I think your best bet (without spending too much on this) is to convert it back to csv and manipulate it, pandas can do this conversion at miliseconds. – stucash Nov 27 '17 at 14:55
  • @stucash My csv-files are only between 10000 and 100000 rows, so it shouldn't be necessary to chunk them. In the other threads I have read that people have similar problems with the speed of to_sql. That is why I want to just read one row at a time, and the use an sql-query to export the data. I will, however, look into your other referal with the IO library. Thanks. – Esben Eickhardt Nov 27 '17 at 14:59
  • @EsbenEickhardt no problem. I am just thinking you could use the chunksize to prevent timeout and to speed up (given your table size, it's likely). It is there as a potential solution (especially to prevent timeout), so I didn't come from the perspective of being necessary or not. As long as it quacks, it is a duck :) if it is too slow, you might give it go :) but also true that to_sql is not optimal. good luck. – stucash Nov 27 '17 at 15:05
  • @EsbenEickhardt in case you are still at this, here's potential solution to the slowness: https://stackoverflow.com/questions/31997859/bulk-insert-a-pandas-dataframe-using-sqlalchemy from what I read, slowness was caused by one-by-one insertion; bulk insertion most of the case will speed it up very noticeably. – stucash Nov 27 '17 at 15:13