3

I am a relatively new user of Python. What is the best way of parsing and processing a CSV and loading it into a local Postgres Database (in Python)?

It was recommended to me to use the CSV library to parse and process the CSV. In particular, the task at hand says:

The data might have errors (some rows may be not be parseable), the data might be duplicated, the data might be really large.

Is there a reason why I wouldn't be able to just use pandas.read_csv here? Does using the CSV library make parsing and loading it into a local Postgres database easier? In particular, if I just use pandas will I run into problems if rows are unparseable, if the data is really big, or if data is duplicated? (For the last bit, I know that pandas offers some relatively clean solutions for de-dupping).

I feel like pandas.read_csv and pandas.to_sql can do a lot of work for me here, but I'm not sure if using the CSV library offers other advantages.

Just in terms of speed, this post: https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file seems to suggest that pandas.read_csv performs the best?

Community
  • 1
  • 1
Vincent
  • 7,808
  • 13
  • 49
  • 63
  • "Best" is undefined so this is gonna get closed as opinion-based as it's now. – ivan_pozdeev Mar 14 '16 at 02:20
  • Sorry, I guess it's not the best title, but the question lies really more in the comments. For example, does using the `csv` library give me more ability to handle very large datasets over pandas? Or does using the `csv` library allow for easier handling of duplicates? – Vincent Mar 14 '16 at 02:23
  • A cornerstone issue to your dilemma may be finding out what your advisor meant when they recommended to "use a CSV library". They may very well just urged you not to try parsing it by hand. – ivan_pozdeev Mar 14 '16 at 02:44

2 Answers2

2

A quick googling didn't reveal any serious drawbacks in pandas.read_csv regarding its functionality (parsing correctness, supported types etc.). Moreover, since you appear to be using pandas to load the data into the DB, too, reading directly into a DataFrame is a huge boost in both performance and memory (no redundant copies).

There are only memory issues for very large datasets - but these are not library's fault. How to read a 6 GB csv file with pandas has instructions on how to process a large .csv in chunks with pandas.

Regarding "The data might have errors", read_csv has a few facilities like converters, error_bad_lines and skip_blank_lines (specific course of action depends on if and how much corruption you're supposed to be able to recover).

Community
  • 1
  • 1
ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
1

I had a school project just last week that required me to load data from a csv and insert it into a postgres database. So believe me when I tell you this: it's way harder than it has to be unless you use pandas. The issue is sniffing out the data types. Okay, so if your database is all a string datatype, forget what I said, you're golden. But if you have a csv with an assortment of datatypes, either you get to sniff them yourself or you can use pandas which does it efficiently and automatically. Plus pandas has a nifty write to sql method which can be easily adapted to work with postgres via a sql alchemy connection, too.

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

Mike Lane
  • 1,134
  • 7
  • 19