0

I want to split each field below into separate columns, and I want to do it without specifying column names, since I am processing a lot of files together, one after another, and each file has different number of columns. The goal is to read the each file in to a dataframe, with its filename as a dataframe name. I am using a dictionary for the same:

enter image description here

import pandas as pd
import zipfile
import re

Tables = {}

with zipfile.ZipFile('*.zip') as z:
    for filename in z.namelist():
        df_name = filename.split(".")[1]
        if df_name == 'hp':
            with (z.open(filename)) as f:
                content = f.read().decode('utf-8')
                content = NewLineCorrection(content)
                df= pd.DataFrame(content)
                cols = list(df[0][0])
                df[0] = list(map(lambda el:[el], df[0]))
                #df[0] = df[0].split(',')
                print(df.head())
                #df.columns = df.iloc[0]
                #df = df.drop(index=0).reset_index(drop=True)
                #Tables[df_name] = df

def NewLineCorrection(content):
    corrected_content = ( re.sub(r'"[^"]*"', 
                             lambda x: 
                                 re.sub(r'[\r\n\x0B\x0C\u0085\u2028\u2029]', 
                                        '', 
                                        x.group()), 
                             content) )
    corrected_content = corrected_content.replace('"', '')
    corrected_content = corrected_content.replace('||@@##', ',')

    ContentList = list(corrected_content.splitlines())
    return ContentList

The .split() function is not working for me for some reason and I am not sure how to find out why.

cyrus24
  • 353
  • 3
  • 9
  • Please check [Handle unwanted line breaks with read_csv in Pandas](https://stackoverflow.com/questions/52366670/handle-unwanted-line-breaks-with-read-csv-in-pandas) – Wiktor Stribiżew Apr 15 '21 at 08:36
  • Thanks Wiktor, I checked out your suggestion. I modified my question to accommodate the some changes and progress so far. I do not think I have an option to read the file as csv once it is processed as text file using the `NewLineCorrection` function, the text file content has to be read to a DataFrame instead. – cyrus24 Apr 15 '21 at 12:45

1 Answers1

0

This is probably not the most efficient way, but it is flexible as you can use the pandas read_csv reader for csv file. In this way you can for instance automatically parse the dates/times in the csv files.

In the example below I assume to have two csv files with the contents

$ cat csv_with_brackets1.csv
[col1, col2, col3, col4]
[1, 14, 2015-12-02 17:02:32, 2, 1 ]
[2, 14, 2016-12-02 17:02:32, 4, 3 ]
[3, 14, 2018-12-02 17:02:32, 8, 4 ]
[4, 14, 2019-12-02 17:02:32, 9, 2 ]

$ cat csv_with_brackets2.csv
[othercol1, othercol2, othercol3, othercol4]
[1, 16, 2005-12-02 13:02:32, 3, 1 ]
[2, 16, 2006-12-02 13:02:32, 9, 3 ]
[3, 16, 2008-12-02 13:02:32, 8, 4 ]
[4, 16, 2009-12-02 13:02:32, 1, 2 ]

The main problem is that your lines start with a [ and end with an ] character, which is not standard. In the function below 'read_csv_with_brackets' you first remove the brackets by creating a cleaned temporary csv file, and then use the read_csv function from pandas to actually read the data.

import re
from glob import glob
from pathlib import Path

import pandas as pd


def read_csv_with_brackets(filename):
    print(f"reading {filename}")
    tmp_file = "tmp_csv.txt"
    with open(filename, "r") as in_stream, open(tmp_file, "w") as out_stream:
        for line in in_stream.readlines():
            clean_line = re.sub("^\[|\]$", "", line.strip()) + "\n"
            out_stream.write(clean_line)
    data_df = pd.read_csv(tmp_file, parse_dates=True)
    Path(tmp_file).unlink()

    return data_df


file_collection = dict()
for file_name in glob("*.csv"):
    file_base = Path(file_name).stem
    df = read_csv_with_brackets(filename=file_name)
    file_collection[file_base] = df

for file_name, df in file_collection.items():
    print(f"Contents file {file_name}")
    print(df)

The downside of this approach is that you have to read each data file two times. But the plus is that you can deal with non valid columns because read_csv is very robust.

The output of the script looks like this

Contents file csv_with_brackets1
   col1                  col2   col3   col4
1    14   2015-12-02 17:02:32      2      1
2    14   2016-12-02 17:02:32      4      3
3    14   2018-12-02 17:02:32      8      4
4    14   2019-12-02 17:02:32      9      2
Contents file csv_with_brackets2
   othercol1             othercol2   othercol3   othercol4
1         16   2005-12-02 13:02:32           3           1
2         16   2006-12-02 13:02:32           9           3
3         16   2008-12-02 13:02:32           8           4
4         16   2009-12-02 13:02:32           1           2
Eelco van Vliet
  • 1,198
  • 12
  • 18