Python regex: create pandas Dataframe with JavaScript strings of new Array()

Question

I am tryng to iterate a Javascript response in order to build a pandas Dataframe. I am starting in regex and JS, so there might be obviousl improvement to make code more reliable. Javascript response, return the following string:


    mmSuggestDeliver(0, new Array("Name", "Category", "Keywords", "Bias", "Extension", "IDs"), new Array(new Array("HSBC Holdings plc (Spons. ADRs)", "Stocks", "HSBC|US4042804066|HSBC||", "75", "", "hsbc|HSBC|1|4917"),new Array("HSBC Holdings plc", "Stocks", "|GB0005405286|||HSBA", "75", "", "hsbc-gb0005405286||1|1046"),new Array("HSBC Trinkaus & Burkhardt AG", "Stocks", "|DE0008115106|||TUBG", "75", "", "hsbc_trinkausburkhardt||1|3774"),new Array("HSBC Bank Malta Plc Registered Shs", "Stocks", "|MT0000030107|||", "75", "", "hsbc_bank_malta||1|16831644"),new Array("HSBC-D7 SA de CV SIID (A)", "Stocks", "|MX51HS0Q00E8|||", "75", "", "hsbc-d7||1|5125971"),new Array("HSBC US Buy-Out GmbH & Co. KGaA", "Stocks", "|DE000A0MM6H7|||", "75", "", "hsbc_us_buy-out_gmbhco||1|23145"),new Array("HSBC Holdings PLC ADR Cert Deposito Arg Repr 0.5 ADRs", "Stocks", "|ARDEUT112257|||", "75", "", "hsbc_2||1|1399269"),new Array("HSBC Holdings PLC6.2 % Pfd Shs Sponsored American Deposit Repr 1/40th 6.2 % PfdShs Ser -A-", "Stocks", "HSBC.PA|US4042806046|HSBC.PA||", "75", "", "hsbc-pa|HSBC.PA|1|19327"),new Array("HSBC Holdings PLC 8 % Perp Sub Cap Secs 2010-Without Fixed Maturity Series -2-", "Stocks", "HSEB|US4042808026|HSEB||", "75", "", "hseb|HSEB|1|5083319"),new Array("HSBC Holdings PLC 8 1-8 % Perpetual Sub Cap Secs 2008-Exch into Non-Cum Dollar Pref Shs", "Stocks", "HSEA|US4042807036|HSEA||", "75", "", "hsea|HSEA|1|3782270")), 10, 0);

I want data organised as the table bellow BUT without the double quote:


    "Name", "Category", "Keywords", "Bias", "Extension", "IDs"
    "HSBC Holdings plc (Spons. ADRs)", "Stocks", "HSBC|US4042804066|HSBC||", "75", "", "hsbc|HSBC|1|4917"
    "HSBC Holdings plc", "Stocks", "|GB0005405286|||HSBA", "75", "", "hsbc-gb0005405286||1|1046")
    ......
    "HSBC Holdings PLC 8 % Perp Sub Cap Secs 2010-Without Fixed Maturity Series -2-", "Stocks", "HSEB|US4042808026|HSEB||", "75", "", "hseb|HSEB|1|5083319"
    "HSBC Holdings PLC 8 1-8 % Perpetual Sub Cap Secs 2008-Exch into Non-Cum Dollar Pref Shs", "Stocks", "HSEA|US4042807036|HSEA||", "75", "", "hsea|HSEA|1|3782270"

Ideally i would like to have final result into a pandas Dataframe.

the bellow code would kind of work but has many caveats. Any optimisation/improvement and correction would be greatly apreciated:

code to get column names:

    js_text = """
mmSuggestDeliver(0, new Array("Name", "Category", "Keywords", "Bias", "Extension", "IDs"), 
new Array(new Array("HSBC Holdings plc (Spons. ADRs)", "Stocks", "HSBC|US4042804066|HSBC||", "75", "", "hsbc|HSBC|1|4917"),
new Array("HSBC Holdings plc", "Stocks", "|GB0005405286|||HSBA", "75", "", "hsbc-gb0005405286||1|1046"),
new Array("HSBC Trinkaus & Burkhardt AG", "Stocks", "|DE0008115106|||TUBG", "75", "", "hsbc_trinkausburkhardt||1|3774"),
new Array("HSBC Bank Malta Plc Registered Shs", "Stocks", "|MT0000030107|||", "75", "", "hsbc_bank_malta||1|16831644"),
new Array("HSBC-D7 SA de CV SIID (A)", "Stocks", "|MX51HS0Q00E8|||", "75", "", "hsbc-d7||1|5125971"),
new Array("HSBC US Buy-Out GmbH & Co. KGaA", "Stocks", "|DE000A0MM6H7|||", "75", "", "hsbc_us_buy-out_gmbhco||1|23145"),
new Array("HSBC Holdings PLC ADR Cert Deposito Arg Repr 0.5 ADRs", "Stocks", "|ARDEUT112257|||", "75", "", "hsbc_2||1|1399269"),
new Array("HSBC Holdings PLC6.2 % Pfd Shs Sponsored American Deposit Repr 1/40th 6.2 % PfdShs Ser -A-", "Stocks", "HSBC.PA|US4042806046|HSBC.PA||", "75", "", "hsbc-pa|HSBC.PA|1|19327"),new Array("HSBC Holdings PLC 8 % Perp Sub Cap Secs 2010-Without Fixed Maturity Series -2-", "Stocks", "HSEB|US4042808026|HSEB||", "75", "", "hseb|HSEB|1|5083319"),new Array("HSBC Holdings PLC 8 1-8 % Perpetual Sub Cap Secs 2008-Exch into Non-Cum Dollar Pref Shs", "Stocks", "HSEA|US4042807036|HSEA||", "75", "", "hsea|HSEA|1|3782270")), 10, 0);
    """

    regex_text = r"new Array\((.*)\)"
    column_header = _re.search(regex_text, js_text, flags=_re.MULTILINE).group(1)
    regex_text = ', (?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)'
    column_header = _re.split(regex_text, column_header, flags=_re.MULTILINE)
    print('regex1:', column_header)

which prints: regex1: ['"Name"', '"Category"', '"Keywords"', '"Bias"', '"Extension"', '"IDs"'] hugly mix of single quotes and double quotes...

code to get datas:

    regex_text = r"new Array\(([\s\S]*?)\),"
    table_rows = _re.findall(regex_text, js_text, flags=_re.MULTILINE)
    table_rows.pop(0)
    table_rows[0] = str(table_rows[0]).replace('new Array(', '')

    my_data = []
    regex_text = ', (?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)'

    for my_row in table_rows:
        my_row = _re.split(regex_text, my_row, flags=_re.MULTILINE)
        print('row is:', my_row)
        my_data.append(my_row)
    result_df = _pd.DataFrame(data=my_data, columns=column_header)
    print(result_df)
    print(result_df.dtypes)

which print an hugly DF as:

                                                  "Name"  ...                               "IDs"
    0                  "HSBC Holdings plc (Spons. ADRs)"  ...                  "hsbc|HSBC|1|4917"
    1                                "HSBC Holdings plc"  ...         "hsbc-gb0005405286||1|1046"
    2                     "HSBC Trinkaus & Burkhardt AG"  ...    "hsbc_trinkausburkhardt||1|3774"
    3               "HSBC Bank Malta Plc Registered Shs"  ...       "hsbc_bank_malta||1|16831644"
    4                        "HSBC-D7 SA de CV SIID (A)"  ...                "hsbc-d7||1|5125971"
    5                  "HSBC US Buy-Out GmbH & Co. KGaA"  ...   "hsbc_us_buy-out_gmbhco||1|23145"
    6  "HSBC Holdings PLC ADR Cert Deposito Arg Repr ...  ...                 "hsbc_2||1|1399269"
    7  "HSBC Holdings PLC6.2 % Pfd Shs Sponsored Amer...  ...           "hsbc-pa|HSBC.PA|1|19327"
    8  "HSBC Holdings PLC 8 % Perp Sub Cap Secs 2010-...  ...               "hseb|HSEB|1|5083319"
    9  "HSBC Holdings PLC 8 1-8 % Perpetual Sub Cap S...  ...              "hsea|HSEA|1|3782270")


    [10 rows x 6 columns]
    "Name"          object
     "Category"     object
     "Keywords"     object
     "Bias"         object
     "Extension"    object
     "IDs"          object
    dtype: object

The data code has numerous caveats, including:

if the "new Array" part of the JS has more/less spaces, the code will not find the row. I had initially used start end fo regex but then had another issue when there was no space separation between "," and "new Array" :regex:"new Array(\"(.*))"
all values are double quoted
it doesn't look very pythonesque...

Thanks for the help!

So far i looked at the following links amongst others:

enter link description here

Julio · Answer 1 · 2020-02-10T16:41:00.863

Try to find within a loop (until no match) the following regex on your input string:

\bnew\s+Array\s*[(]\s*(".*?)(?=[)]\s*,\s*new\s+Array|\s*[)]\s*[)]\s*[^(]+$)

Then, for every iteration, get the first capturing group.

The first iteration should get the header, while the following iterations should get the data.

You have a demo here (Green highlighting are the data you would keep with first capturing group)

If R supports \K, this is another alternative that may work for you:

\bnew\s+Array\s*[(]\s*\K"(?:[^"\\]+|\\.)*"\s*(?:,\s*"(?:[^"\\]++|\\.)*")+

In that case, you don't need to use capturing groups. All the data will be on the same match.

You have a demo here for that second approach.

Python regex: create pandas Dataframe with JavaScript strings of new Array()

1 Answers1