Pandas load qualified .txt file which has column values split across multiple rows

Question

I've got multiple .txt files (c.2.5GB each) which I want to load into a SQL db. I'm trying to use Pandas for this, by first reading each file into a dataframe (maybe in chunks to avoid memory issue) and then loading the dataframe into a SQL db.

Each file only has 2 columns (column delimiter is ¬), however the values in the second column could be split across multiple lines. The column values are qualified/wrapped in double quotes. When I read the file, pandas chops these lines into multiple lines every time it encounters newline characters - with data then going into the wrong columns. I want to read everything enclosed in double quotes as a single text-stream/value. How can I correct this?

PS - sometimes the column delimiter is also contained within the data, but within double quotes.

Sample Data

"LINE_ID"¬"LINE_TEXT"
"C1111-G00-BC222"¬"this line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
"
"C22-f0-333"¬"2nd row. This line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
  *******************************************************************
  This line also includes the column delimiter within text qualifier
  *******************************************************************
  # !¬!¬!¬|
"

My Code

import pandas as pd
import os
from dbconnection import DBConnection

path = r'C:\Sample_load_file.txt'
df = pd.read_csv(path, quotechar='"', engine='python', sep = "\¬")


#Check
df

#Add meta data
df['loaded_by'] = 'xxxx'
df['file_line_number'] = range(2,len(df)+2)
df['load_date'] = pd.datetime.now()
df['source_file'] = path
df['loading_script'] = r'xxxx.ipynb'

#db = DBConnection(server ='XXXX', database='XXXX')
df.to_sql('table_name', db.engine, schema='src', index=False, if_exists='append')

df_from_sql = pd.read_sql('select count(*) from src.table_name', db.engine)

blondelg · Answer 1 · 2021-05-14T14:50:57.280

Here is a script from which you can start to process your data

Load text file

with open('data.txt') as file:
    data = file.read()

Build a list of lines based on ' "\n" ' as custom separator

data = data.split('"\n"')

Split each line based on ' "¬" ' as custom separator

data = [line.split('"¬"') for line in data]

Cleanup remaining double quotes

data = [[e.replace('"', '') for e in line] for line in data]

Load data into a dataframe

df = pd.DataFrame(data)

Reformat dataframe

df.columns = df.iloc[0] # Set first row as column index
df.iloc[1:].reset_index(drop=True) # Drop first line and reset index

UPDATE: in one script

data = """"LINE_ID"¬"LINE_TEXT"
"C1111-G00-BC222"¬"this line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
"
"C22-f0-333"¬"2nd row. This line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
  *******************************************************************
  This li
  ne also includes the column delimiter within text qualifier
  *******************************************************************
  # !¬!¬!¬|
"
"""

data = data.split('"\n"')
data = [line.split('"¬"') for line in data]
data = [[e.replace('"', '') for e in line] for line in data]
df = pd.DataFrame(data)
df.columns = df.iloc[0] # Set first row as column index
df = df.iloc[1:].reset_index(drop=True) # Drop first line and reset index
df

Thanks! I tried `data = [[line[0].replace('"', '') , line[1].replace('"', '')] for line in data]` but this gives an error `IndexError: list index out of range` — Chipmunk_da, May 14 '21 at 14:13
I ran the code on the sample data I posted in the question. And it doesn't seem to be working as expected. The 2 columns `LINE_ID` and `LINE_TEXT` are still concatenated together (with `¬` and `double quotes` still present). — Chipmunk_da, May 14 '21 at 14:41
I'm basically trying to create a df with 2 clean columns that I can upload to SQL — Chipmunk_da, May 14 '21 at 14:43
thanks for your help! I've used your answer now (just adding a line to split the columns). I'm still trying to solve the issue of running out of memory when reading the large file (I think your code reads the entire file into memory). So I've posted the question here - https://stackoverflow.com/q/67537970/6900402 — Chipmunk_da, May 14 '21 at 16:50

Pandas load qualified .txt file which has column values split across multiple rows

1 Answers1

UPDATE: in one script