In one of my classes, we are required to load and append 6 CSV files together, using python. These files do NOT have the headers included. I have tried for hours, multiple methods for either directly combining the CSV files or reading them individually and appending them, using different guides online. For what should be a simple task, I am running into numerous issues. When I tried an approach that included joining the files right out of the gate, I was getting numerous error messages in response.
To confirm, the initial steps I need to perform are:
- Load CSV files from local directory into dataframe
- Add the provided headers, which are missing in the CSVs
- Append the 6 CSV files together into one consolidated dataset
(Not necessarily in that order)
The data files are located here, for reproducibility. https://drive.google.com/drive/folders/1ZKBFbsUBNUhsWtVtsMqOtXKx4SL-pFnt?usp=sharing
Here are the files we are using
I tried using the following script I found online to append all the CSV's together from the get-go, but was receiving numerous errors
import pandas as pd
import glob
import os
# setting the path for joining multiple files
files = os.path.join("D:/User Data/Dropbox/2022-10-19 Semester/StudentFiles/StudentDataFiles/Data Files/", "*.csv")
# list of merged files returned
files = glob.glob(files)
print(files);
# joining files with concat and read_csv
df = pd.concat(map(pd.read_csv, files), ignore_index=True)
print(df)
The script works down to the files = glob.glob(files) section, as I can print the results. However, it seems the errors come in with the pd.concat statement... shown below
Traceback (most recent call last):
File "D:/User Data/Dropbox/2022-10-19 Semester/StudentFiles/StudentDataFiles/IN498_M2_2.py", line 14, in <module>
df = pd.concat(map(pd.read_csv, files), ignore_index=True)
File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\core\reshape\concat.py", line 347, in concat
op = _Concatenator(
File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\core\reshape\concat.py", line 401, in __init__
objs = list(objs)
File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 575, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 933, in __init__
self._engine = self._make_engine(f, self.engine)
File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 1235, in _make_engine
return mapping[engine](f, **self.options)
File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 75, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas\_libs\parsers.pyx", line 544, in pandas._libs.parsers.TextReader.__cinit__
File "pandas\_libs\parsers.pyx", line 633, in pandas._libs.parsers.TextReader._get_header
File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
I thought perhaps these errors had to do with the layout of the data or structure of the CSV file, so instead, I tried starting off by simply uploading a singular CSV into a python dataframe and adding headers to it. It seems however that when I try to do this the data is being recognized as only one column, shown below, like the columns aren't being split by the delimiters, even though it is comma delimited, which should be natively readable within pandas. So, I thought maybe the issue was the missing headers, or missing values within the dataset or something, but I don't know what is causing the issues.... I have tried accomplishing this multiple ways, to no avail. I have tried to using various attributes within the read_csv function, including: names=headerslist, encoding, header=none, keep_default_na=False, sep=',', skiprows=[0], and a few others.
import pandas as pd
import glob
import os
import csv
headerslist = ['Date','Package_Name','Country','Store_Listing_Visitors','Installers','Visitor-to-Installer_conversion_rate','Installers_retained_for_1_day','Installer-to-1_day_retention_rate','Installers_retained_for_7_days','Installer-to-7_days_retention_rate','Installers_retained_for_15_days','Installer-to-15_days_retention_rate','Installers_retained_for_30_days','Installer-to-30_days_retention_rate']
df = pd.read_csv('D:/User Data/Dropbox/2022-10-19 Semester/StudentFiles/StudentDataFiles/Data Files/retained_installers_com.foo.bar_201904_country.csv', keep_default_na=False, sep=',', skiprows=[0], delimiter=None, header=None, encoding='cp1252')
df2 = headerslist.append(df)
print(df)
I have tried Adrian's solution, but get an odd error back
import csv
import glob
import os
files = os.path.join("D:/User Data/Dropbox/Kristophers Files/School (Purdue Global)/2022-10-19 Semester/IN498 - Capstone/StudentFiles/StudentDataFiles/Data Files/", "*.csv")
# list of merged files returned
files = glob.glob(files)
header = None
new_file = []
for f in (files):
with open(f, newline='') as csv_file:
reader = csv.reader(csv_file)
if not header:
new_file.append(next(reader))
header = True
else:
next(reader)
for row in reader:
new_file.append(row)
with open('CombinedCSV.csv', 'w', newline='') as csv_out:
writer = csv.writer(csv_out)
writer.writerows(new_file)
Error/Traceback:
C:\Users\KDPen\anaconda3\python.exe "D:\User Data\Dropbox\2022-10-19 Semester\IN498_M2_3.py"
Traceback (most recent call last):
File "D:\User Data\Dropbox\2022-10-19 Semester\IN498_M2_3.py", line 20, in <module>
next(reader)
_csv.Error: line contains NUL
Process finished with exit code 1