-2

I have an input string like the one shown below. I would like to parse it based on commas into a dict like the output shown below. The problem is that sometimes there are commas contained inside of parenthesis like the example below, and also quotes inside quotes. I'm not that handy with regexpression matching, so any tips are greatly appreciated.

input:

"ty_event_name, from_unixtime(unix_timestamp(regexp_replace(ty_date,'/','-'),'MM-dd-yyyy'),'yyyy-MM-dd') as ty_date,'${hiveconf:run_dt}' as sessions_fy,orders_xy"

output:

{1:'ty_event_name',
2:'from_unixtime(unix_timestamp(regexp_replace(ty_date,'/','-'),'MM-dd-yyyy'),'yyyy-MM-dd') as ty_date',
3:''${hiveconf:run_dt}' as sessions_fy',
4:'orders_xy'}

Tried:

import pandas as pd
import numpy as np
import re

teststr="ty_event_name, from_unixtime(unix_timestamp(regexp_replace(ty_date,'/','-'),'MM-dd-yyyy'),'yyyy-MM-dd') as ty_date,'${hiveconf:run_dt}' as sessions_fy,orders_xy"

tstr=re.sub('(?!\B"[^"]*),(?![^"]*"\B)',',',teststr).split()

tstr

Output:

['ty_event_name,',
 "from_unixtime(unix_timestamp(regexp_replace(ty_date,'/','-'),'MM-dd-yyyy'),'yyyy-MM-dd')",
 'as',
 "ty_date,'${hiveconf:run_dt}'",
 'as',
 'sessions_fy,orders_xy']
user3476463
  • 3,967
  • 22
  • 57
  • 117
  • Please show us what you've tried so that we can tailor our answers to where you are already at in the process. Requests for code don't tend to attract a lot of quality answers. You'll usually get what you're looking for faster if you include more context, especially the code you've already tried with an explanation of why it doesn't suit your needs. – JDB Dec 21 '18 at 20:17
  • Possible duplicate of [Regex find comma not inside quotes](https://stackoverflow.com/questions/21105360/regex-find-comma-not-inside-quotes) – Nick Dec 21 '18 at 20:25
  • @Nick I've added updates to make the formatting of the input clearer, and added something I tried based on your suggested comment. I don't think this is exactly the same as your suggestion because the issue isn't just the quotes. The main issue is splitting the string on commas that are outside of parenthesis. – user3476463 Dec 21 '18 at 21:17

1 Answers1

0

This looks like it did the trick:

code:

re.split(r',\s*(?=[^)]*(?:\(|$))', teststr) 

output:

['ty_event_name',
 "from_unixtime(unix_timestamp(regexp_replace(ty_date,'/','-'),'MM-dd-yyyy'),'yyyy-MM-dd') as ty_date",
 "'${hiveconf:run_dt}' as sessions_fy",
 'orders_xy']
user3476463
  • 3,967
  • 22
  • 57
  • 117