0

I am trying to separate text using a semi-colon delimiter. I have a source text file that contains a lot of SQL queries. For example:

SELECT 
    employee.id,
    employee.first_name,
    employee.last_name,
    SUM(DATEDIFF("SECOND", call.start_time, call.end_time)) AS call_duration_sum
FROM call
INNER JOIN employee ON call.employee_id = employee.id
GROUP BY
    employee.id,
    employee.first_name,
    employee.last_name
ORDER BY
    employee.id ASC;
    
    
------------------------
SELECT 
    single_employee.id,
    single_employee.first_name,
    single_employee.last_name,
    single_employee.call_duration_avg,
    single_employee.call_duration_avg - avg_all.call_duration_avg AS avg_difference
FROM
(
    SELECT 
        1 AS join_id,
        employee.id,
        employee.first_name,
        employee.last_name,
        AVG(DATEDIFF("SECOND", call.start_time, call.end_time)) AS call_duration_avg
    FROM call
    INNER JOIN employee ON call.employee_id = employee.id
    GROUP BY
        employee.id,
        employee.first_name,
        employee.last_name
) single_employee
    
INNER JOIN
    
(
    SELECT
        1 AS join_id,
        AVG(DATEDIFF("SECOND", call.start_time, call.end_time)) AS call_duration_avg
    FROM call
) avg_all ON avg_all.join_id = single_employee.join_id;

-----------------------------
SELECT *
FROM call
ORDER BY
    call.employee_id ASC,
    call.start_time ASC;

These 3 sample queries must be split into 3 rows of a dataframe - each containing a query.

As you can see there are complex queries in this text and I see that the problem with regex is the \n character.

I see that there is no direct way to do this and the original file contains around 2000 SQL queries.

I would like to extract each query as a single entity and place it in a separate row.

Vyas
  • 57
  • 11
  • Does this answer your question? [Regular expression matching a multiline block of text](https://stackoverflow.com/questions/587345/regular-expression-matching-a-multiline-block-of-text) – alexanderbird Jul 07 '21 at 13:49
  • @Vyas - let's divide this into two parts. One splitting these queries, second adding it into a DF. Once you split it will result into list and then all the element of list can be moved to DF as per your requirement. Split if for : and strip every element for '/n'. That will work. – Manpreet Jul 07 '21 at 14:08
  • The split is the complicated part and I am finding this difficult. The split criteria is the semi-colon but the \n is coming in between. I understand the process though. – Vyas Jul 07 '21 at 14:17

1 Answers1

0

I was able to implement a decent solution using pandas.

store the data in the same file in """data1"""

    data1 = data.replace('\n', '').split(';')
    df = pd.DataFrame.from_dict({'column': data1})
    for i in range(len(df)):
        df['column'][i] = df['column'][i].strip('-')
    # convert the entire dataframe  to a pandas DF and store in CSV
    df.to_csv('file_name.csv', index=False)

Vyas
  • 57
  • 11