I'm struggling in the last step of building a crawler. If I try to crawl a couple of files with the same structure it works perfect. But if I try to grab ones with an older schema (missing columns) I get ex.
ParserError: Expected 42 fields in line 273, saw 47
If I change the engine to python-fwf it works. But I can't use index_col = "Div"
any more. But this one is needed to deal with na rows without producing errors.
def dataImport(df_raw):
header_appended = False
concat_i = []
progress = len(linkGenerator())
count = 1
debugging_var = 1
print("0 of " + str(progress))
for i in list(linkGenerator()):
if debbuging_mode == True:
df_debugger = pd.read_csv(i, sep = ",", header = 0, engine = "python", encoding = "ISO-8859-1", index_col = "Div")
df_debugger.to_csv(debbuging_path + str(debugging_var) + "_of_" + str(progress) + ".csv")
debugging_var = debugging_var + 1
if header_appended != True:
print("Appending : " + str(i))
df_raw = pd.read_csv(i, sep = ",", engine = "python", encoding = "ISO-8859-1", index_col = False)
header_appended = True
print("Appended.")
time.sleep(2)
else:
print("Appending : " + str(i))
df_internal = pd.read_csv(i, sep = ",", engine = "python", encoding = "ISO-8859-1", index_col = False)
concat_i.append(df_internal)
print("Appended.")
time.sleep(2)
print(str(count) + " of " + str(progress))
count = count + 1
df_raw = pd.concat(concat_i, ignore_index = True)
df_raw.dropna(subset = ["Div"], inplace = True)
return df_raw
I tried useing names = range(100) or stuff like this: import csv with different number of columns per row using Pandas
In my opinion df_raw = pd.concat(concat_i, ignore_index = True)
is the problem.
Glad to receive help.
Cheers! :)