0

I'm struggling in the last step of building a crawler. If I try to crawl a couple of files with the same structure it works perfect. But if I try to grab ones with an older schema (missing columns) I get ex.

ParserError: Expected 42 fields in line 273, saw 47

If I change the engine to python-fwf it works. But I can't use index_col = "Div" any more. But this one is needed to deal with na rows without producing errors.

def dataImport(df_raw):
    header_appended = False
    concat_i = []
    progress = len(linkGenerator())
    count = 1
    debugging_var = 1
    print("0 of " + str(progress))
    for i in list(linkGenerator()):
        if debbuging_mode == True:
            df_debugger = pd.read_csv(i, sep = ",", header = 0, engine = "python", encoding = "ISO-8859-1", index_col = "Div")
            df_debugger.to_csv(debbuging_path + str(debugging_var) + "_of_" + str(progress) + ".csv")
            debugging_var = debugging_var + 1
        if header_appended != True:
            print("Appending : " + str(i))
            df_raw = pd.read_csv(i, sep = ",", engine = "python", encoding = "ISO-8859-1", index_col = False)
            header_appended = True
            print("Appended.")
            time.sleep(2)
        else:
            print("Appending : " + str(i))
            df_internal = pd.read_csv(i, sep = ",", engine = "python", encoding = "ISO-8859-1", index_col = False)
            concat_i.append(df_internal)
            print("Appended.")
            time.sleep(2)
        print(str(count) + " of " + str(progress))
        count = count + 1
    df_raw = pd.concat(concat_i, ignore_index = True)
    df_raw.dropna(subset = ["Div"], inplace = True)
    return df_raw

I tried useing names = range(100) or stuff like this: import csv with different number of columns per row using Pandas

In my opinion df_raw = pd.concat(concat_i, ignore_index = True) is the problem.

Glad to receive help.

Cheers! :)

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • When you write a function, it should do 1 thing, not multiple things. As such there should be a single function to grab the files into a `list`. Once you have all the files in a list, they can be concatenated together. `df = pd.concat([pd.read_csv(file) for file in csv_file_list]).reset_index(drop=True)` – Trenton McKinney Jan 19 '21 at 18:39
  • `Expected 42 fields in line 273, saw 47` means the number of comma separated values in a row, does not match the length of the header. So, the csv file is not correctly formed at row 273. – Trenton McKinney Jan 19 '21 at 18:42
  • @TrentonMcKinney agree. But in this case it's a false friend. I've checked the files manually. Collumns are correct and there are no extra commas. I did several import runs and the problem only appears when two files with different header information came together. Lets say first batch has 50 col, sec. 30 and last 10. All together out of 50, 30 or 10 don't produce any errors. From 50 to 30 it crahes. From 30 to 10 as well. – HoneyCodeBadger Jan 20 '21 at 07:52
  • @TrentonMcKinney now I've checked your first comment as well. First of all, many thx. for trying helping me. Exactly that's what I've done. For `for i in list(linkGenerator()): `I take over the list that I previously created in the linkGenerator method. This is stored in another place in the module. The entire code block serves only one purpose, the aggregation of the data. I tried your solution as a test as well. `csv_file_list = linkGenerator() df = pd.concat([pd.read_csv(file) for file in csv_file_list]).reset_index(drop=True)` C error: Expected 42 fields in line 273, saw 47 – HoneyCodeBadger Jan 20 '21 at 12:39

0 Answers0