0

In one of my classes, we are required to load and append 6 CSV files together, using python. These files do NOT have the headers included. I have tried for hours, multiple methods for either directly combining the CSV files or reading them individually and appending them, using different guides online. For what should be a simple task, I am running into numerous issues. When I tried an approach that included joining the files right out of the gate, I was getting numerous error messages in response.

To confirm, the initial steps I need to perform are:

  • Load CSV files from local directory into dataframe
  • Add the provided headers, which are missing in the CSVs
  • Append the 6 CSV files together into one consolidated dataset

(Not necessarily in that order)

The data files are located here, for reproducibility. https://drive.google.com/drive/folders/1ZKBFbsUBNUhsWtVtsMqOtXKx4SL-pFnt?usp=sharing

Here are the files we are using retained_installers_com.foo.bar_201904_country

Dataset example

Dataset example: splitting CSV by comma delimiter in Excel

I tried using the following script I found online to append all the CSV's together from the get-go, but was receiving numerous errors

import pandas as pd
import glob
import os

# setting the path for joining multiple files
files = os.path.join("D:/User Data/Dropbox/2022-10-19 Semester/StudentFiles/StudentDataFiles/Data Files/", "*.csv")

# list of merged files returned
files = glob.glob(files)

print(files);

# joining files with concat and read_csv
df = pd.concat(map(pd.read_csv, files), ignore_index=True)
print(df)

The script works down to the files = glob.glob(files) section, as I can print the results. However, it seems the errors come in with the pd.concat statement... shown below

Traceback (most recent call last):
  File "D:/User Data/Dropbox/2022-10-19 Semester/StudentFiles/StudentDataFiles/IN498_M2_2.py", line 14, in <module>
    df = pd.concat(map(pd.read_csv, files), ignore_index=True)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\core\reshape\concat.py", line 347, in concat
    op = _Concatenator(
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\core\reshape\concat.py", line 401, in __init__
    objs = list(objs)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 575, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 933, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 1235, in _make_engine
    return mapping[engine](f, **self.options)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 75, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas\_libs\parsers.pyx", line 544, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas\_libs\parsers.pyx", line 633, in pandas._libs.parsers.TextReader._get_header
  File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

I thought perhaps these errors had to do with the layout of the data or structure of the CSV file, so instead, I tried starting off by simply uploading a singular CSV into a python dataframe and adding headers to it. It seems however that when I try to do this the data is being recognized as only one column, shown below, like the columns aren't being split by the delimiters, even though it is comma delimited, which should be natively readable within pandas. So, I thought maybe the issue was the missing headers, or missing values within the dataset or something, but I don't know what is causing the issues.... I have tried accomplishing this multiple ways, to no avail. I have tried to using various attributes within the read_csv function, including: names=headerslist, encoding, header=none, keep_default_na=False, sep=',', skiprows=[0], and a few others.

import pandas as pd
import glob
import os
import csv

headerslist = ['Date','Package_Name','Country','Store_Listing_Visitors','Installers','Visitor-to-Installer_conversion_rate','Installers_retained_for_1_day','Installer-to-1_day_retention_rate','Installers_retained_for_7_days','Installer-to-7_days_retention_rate','Installers_retained_for_15_days','Installer-to-15_days_retention_rate','Installers_retained_for_30_days','Installer-to-30_days_retention_rate']

df = pd.read_csv('D:/User Data/Dropbox/2022-10-19 Semester/StudentFiles/StudentDataFiles/Data Files/retained_installers_com.foo.bar_201904_country.csv', keep_default_na=False, sep=',',  skiprows=[0], delimiter=None,  header=None, encoding='cp1252')

df2 = headerslist.append(df)

print(df)

Output of individual script, showing no separation of columns

I have tried Adrian's solution, but get an odd error back

import csv
import glob
import os

files = os.path.join("D:/User Data/Dropbox/Kristophers Files/School (Purdue Global)/2022-10-19 Semester/IN498 - Capstone/StudentFiles/StudentDataFiles/Data Files/", "*.csv")

# list of merged files returned
files = glob.glob(files)

header = None
new_file = []
for f in (files):
    with open(f, newline='') as csv_file:
        reader = csv.reader(csv_file)
        if not header:
            new_file.append(next(reader))
            header = True
        else:
            next(reader)
        for row in reader:
            new_file.append(row)

with open('CombinedCSV.csv', 'w', newline='') as csv_out:
    writer = csv.writer(csv_out)
    writer.writerows(new_file)

Error/Traceback:

C:\Users\KDPen\anaconda3\python.exe "D:\User Data\Dropbox\2022-10-19 Semester\IN498_M2_3.py" 
Traceback (most recent call last):
  File "D:\User Data\Dropbox\2022-10-19 Semester\IN498_M2_3.py", line 20, in <module>
    next(reader)
_csv.Error: line contains NUL

Process finished with exit code 1
  • It looks like you have many independent issues, why don't you try to solve them one by one? For instance your `UnicodeDecodeError` has nothing to do with adding headers or concatenating DataFrames – mozway Oct 28 '22 at 04:04
  • See [ask]. You need to provide a [mre] and debugging details, including a sample data files. That said, the ```delimiter``` parameter of the ```read_csv()``` is an alias for the ```sep```. – relent95 Oct 28 '22 at 10:29
  • *"...we are required to load and append 6 CSV files together, using python."*. One could interpret that to use [csvstack](https://csvkit.readthedocs.io/en/latest/scripts/csvstack.html) from [csvkit](https://csvkit.readthedocs.io/en/latest/index.html) as it is Python [csvstack source](https://github.com/wireservice/csvkit/blob/master/csvkit/utilities/csvstack.py) Or you could look at my answer here [Merge csv](https://stackoverflow.com/questions/73753861/merge-multiple-csv-with-same-column-headers-python/73756253#73756253) – Adrian Klaver Oct 28 '22 at 15:55
  • @AdrianKlaver, I tried your solution in the linked Merge CSV link. I receive an odd error back from it. I will paste below how I changed the script. – Kristopher Penland Oct 28 '22 at 20:02
  • Add the code and traceback( **as text and properly formatted**) as update to your question. Putting them in the comments make them pretty close to unreadable. – Adrian Klaver Oct 28 '22 at 20:09
  • Apologies @AdrianKlaver, I have deleted the comments and edited my posting to include your attempted solution. It does appear like there is something wonky with the CSV files themselves that is affecting all of these solutions perhaps, based on the traceback I got to using your script. – Kristopher Penland Oct 29 '22 at 01:14
  • See [CSV NUL](https://stackoverflow.com/questions/4166070/python-csv-error-line-contains-null-byte#4169762) and/or search on *python Error: line contains NUL* for more information. – Adrian Klaver Oct 29 '22 at 04:15
  • @AdrianKlaver, thank you for your time and input regarding this! I will check out that posting and see if it helps to resolve this issue. Seems like the structure of these files was intentionally made faulty so that they are not easily importable. – Kristopher Penland Nov 02 '22 at 03:47

1 Answers1

0

It's because some of the files are encoded in UTF-16. You can find an encoding and specify the encoding when calling pandas.read_csv() like this.

import codecs
import pandas as pd

def find_encoding(path):
    with open(path, 'rb') as f:
        bytes = f.read(4)
    for bom, encoding in (
        (codecs.BOM_UTF8, 'utf-8-sig'),
        (codecs.BOM_UTF32_LE, 'utf-32'),
        (codecs.BOM_UTF32_BE, 'utf-32'),
        (codecs.BOM_UTF16_LE, 'utf-16'),
        (codecs.BOM_UTF16_BE, 'utf-16'),
    ):
        if bytes.startswith(bom):
            return encoding
    return None
...
def read_csv(path):
    return pd.read_csv(path, encoding=find_encoding(path), header=0)
df = pd.concat(map(read_csv, files), ignore_index=True)

And your files contain extra double quotes at the beginning and end of lines. To deal with them, do like this instead of the above read_csv().

def read_csv(path):
    lines = []
    with open(path, 'rt', encoding=find_encoding(path)) as f:
        for line in f:
            lines.append(line.rstrip()[1:-1])
    return pd.read_csv(io.StringIO('\n'.join(lines)), header=0)
relent95
  • 3,703
  • 1
  • 14
  • 17