1

TLDR: How can I json.loads with custom separator without replacing the separator with a comma?

I have a spark dataframe, that I want to write to CSV, and for that I need to jsonize every row in it.

So I have the following pyspark row:

Row(type='le', v=Row(occ=False, oov=False, v=True), x=966, y=340)

I want to make the row ready for CSV. If I write to CSV with normal json.dumps, I will get line with many commas, then the read csv method doesn't read the file (a lot more commas)

So, I perform json.dumps with separators=("| ", ": ")), and I get the string s:

'["le"| [false| false| true]| 966| 340]'

Now i'm able to do:

json.loads(s.replace('|',','))

And I receive the desired output:

['le', [False, False, True], 966, 340]

Now is the problematic part:

I write it to csv. When I read it, before trying to json.loads, I receive:

'[\\le\\"| [false| false| true]| 966| 340]"'

The desired output, is as before:

['le', [False, False, True], 966, 340]

But I can't reach it.

When I try to do json.loads, I get:

json.decoder.JSONDecodeError: Expecting value: line 1 column 2 (char 1)

When I try to change the '|' to ',':

s = s.replace('|',',')
s
Out: '[\\left_ear\\", [false, false, true], 966, 340]"'
json.loads(s)
json.decoder.JSONDecodeError: Expecting value: line 1 column 2 (char 1)

This post is a try to overcome a previous problem which I didn't find answer to: Convert multiple array of structs columns in pyspark sql

If I find a solution to this problem it will help me.

Bottom line this is the line I need to parse:

'[\\le\\"| [false| false| true]| 966| 340]"'

How can I do it?

jonb
  • 845
  • 1
  • 13
  • 36
  • Why are you using `separators` to generate invalid JSON in the first place? – chepner Oct 31 '19 at 17:30
  • If I write to CSV with normal json.dumps, I will get line with many commas, then the read csv method doesn't read the file (a lot more commas) – jonb Oct 31 '19 at 17:40
  • Does this answer your question? [How to save a spark DataFrame as csv on disk?](https://stackoverflow.com/questions/33174443/how-to-save-a-spark-dataframe-as-csv-on-disk) – GSazheniuk Oct 31 '19 at 17:43
  • 1
    `json.dumps` isn't writing CSV *at all*; at best, your quasi-JSON will pass as valid CSV input, but I wouldn't count on it. I would look into avoiding JSON altogether, and getting a `list` (or something that the `csv` module can handle) directly from your data frame. – chepner Oct 31 '19 at 17:53
  • @GSazheniuk sadly no, I don't have write permissions on that server, I can only create CSV and get download it for it. – jonb Oct 31 '19 at 20:01
  • @chepner I prefer to use something more interpretable like dictionary... – jonb Oct 31 '19 at 20:02
  • Use whatever *data structure* you want; just don't use JSON as the middleman for converting that to CSV. – chepner Oct 31 '19 at 20:04
  • @chepner it has to go through json: even if I try a list, it is saved as spark's array and therefore it can't be exported to CSV. So it can't remain a complex datatype therefore it must transform to string... – jonb Oct 31 '19 at 21:13

0 Answers0