0

I've got multiple .txt files (c.2.5GB each) which I want to load into a SQL db. I'm trying to use Pandas for this, by first reading each file into a dataframe (maybe in chunks to avoid memory issue) and then loading the dataframe into a SQL db.

Each file only has 2 columns (column delimiter is ¬), however the values in the second column could be split across multiple lines. The column values are qualified/wrapped in double quotes. When I read the file, pandas chops these lines into multiple lines every time it encounters newline characters - with data then going into the wrong columns. I want to read everything enclosed in double quotes as a single text-stream/value. How can I correct this?

PS - sometimes the column delimiter is also contained within the data, but within double quotes.

Sample Data

"LINE_ID"¬"LINE_TEXT"
"C1111-G00-BC222"¬"this line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
"
"C22-f0-333"¬"2nd row. This line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
  *******************************************************************
  This line also includes the column delimiter within text qualifier
  *******************************************************************
  # !¬!¬!¬|
"

My Code

import pandas as pd
import os
from dbconnection import DBConnection

path = r'C:\Sample_load_file.txt'
df = pd.read_csv(path, quotechar='"', engine='python', sep = "\¬")


#Check
df

#Add meta data
df['loaded_by'] = 'xxxx'
df['file_line_number'] = range(2,len(df)+2)
df['load_date'] = pd.datetime.now()
df['source_file'] = path
df['loading_script'] = r'xxxx.ipynb'

#db = DBConnection(server ='XXXX', database='XXXX')
df.to_sql('table_name', db.engine, schema='src', index=False, if_exists='append')

df_from_sql = pd.read_sql('select count(*) from src.table_name', db.engine)
Chipmunk_da
  • 467
  • 2
  • 9
  • 27

1 Answers1

1

Here is a script from which you can start to process your data

Load text file
with open('data.txt') as file:
    data = file.read()
Build a list of lines based on ' "\n" ' as custom separator
data = data.split('"\n"')
Split each line based on ' "¬" ' as custom separator
data = [line.split('"¬"') for line in data]
Cleanup remaining double quotes
data = [[e.replace('"', '') for e in line] for line in data]
Load data into a dataframe
df = pd.DataFrame(data)
Reformat dataframe
df.columns = df.iloc[0] # Set first row as column index
df.iloc[1:].reset_index(drop=True) # Drop first line and reset index

UPDATE: in one script

data = """"LINE_ID"¬"LINE_TEXT"
"C1111-G00-BC222"¬"this line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
"
"C22-f0-333"¬"2nd row. This line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
  *******************************************************************
  This li
  ne also includes the column delimiter within text qualifier
  *******************************************************************
  # !¬!¬!¬|
"
"""

data = data.split('"\n"')
data = [line.split('"¬"') for line in data]
data = [[e.replace('"', '') for e in line] for line in data]
df = pd.DataFrame(data)
df.columns = df.iloc[0] # Set first row as column index
df = df.iloc[1:].reset_index(drop=True) # Drop first line and reset index
df
blondelg
  • 916
  • 1
  • 8
  • 25
  • Thanks! I tried `data = [[line[0].replace('"', '') , line[1].replace('"', '')] for line in data]` but this gives an error `IndexError: list index out of range` – Chipmunk_da May 14 '21 at 14:13
  • I ran the code on the sample data I posted in the question. And it doesn't seem to be working as expected. The 2 columns `LINE_ID` and `LINE_TEXT` are still concatenated together (with `¬` and `double quotes` still present). – Chipmunk_da May 14 '21 at 14:41
  • I'm basically trying to create a df with 2 clean columns that I can upload to SQL – Chipmunk_da May 14 '21 at 14:43
  • thanks for your help! I've used your answer now (just adding a line to split the columns). I'm still trying to solve the issue of running out of memory when reading the large file (I think your code reads the entire file into memory). So I've posted the question here - https://stackoverflow.com/q/67537970/6900402 – Chipmunk_da May 14 '21 at 16:50