How to parse a WordPress CSV export using Python

Question

I need to import content from WordPress into Plone, a Python-based CMS, and I have a dump of the posts table as a huge CSV vanilla file using ";" as a delimiter.

The problem is the standard CSV reader from the csv module is not smart enough to parse the HTML content inside a row (the post_content field).

For instance, when the parser encounters something like <p> </p> it interprets the semicolon as a field delimiter and I end up with more items than fields and with fields with wrong content.

Is there any other option to solve this kind of issues? Processing the row with a regex seems pretty scary to me.

Hum. Would it be okay if you first converted all the HTML into spaces and then tried csv.reader? — NightShadeQueen, Jul 15 '15 at 23:04

hvelarde · Accepted Answer · 2015-07-17T13:07:23.697

2

After some additional research, I discovered the excel-tab dialect by reading the text of the PEP 0305 (which proposed the addition of the cvs module to Python); this is mentioned in the module documentation, but I haven't noticed at first.

I then re-exported the posts using a tab as a delimiter (\t).

enter image description here

I made a test reading a batch of 1,000 rows and found no errors at all.

edited Jul 17 '15 at 13:07

answered Jul 16 '15 at 17:11

hvelarde

2,875
14
34

1

I would have thought that exporting with: `fields enclosed by "` would have solved the issue you mentioned, whatever delimiter you were using – Danimal Jul 17 '15 at 09:43
In my last test, we should not mark "Remove CRLF characters within fields" to make transmogrify.wordpress detect paragraphs – rodfersou Aug 14 '17 at 16:14

score 1 · Answer 2 · answered Jul 16 '15 at 01:02

The CSV module provides the escapechar format parameter, which allows you to escape the delimiter (which you have set to semicolon). If you can provide escapechar='\\' in the call to csv.reader(), you could then replace the character \ in your CSV file with \\, and replace   with &nbsp\; (using a text editor's find/replace option).

score 1 · Answer 3 · answered Jul 17 '15 at 13:09

1

Another option, for smaller sites, could be using pywordpress, a pythonic interface to WordPress XML-RPC API.

answered Jul 17 '15 at 13:09

hvelarde

2,875
14
34

How to parse a WordPress CSV export using Python

3 Answers3