I am pretty sure I cobbled together the wrong parts of other people's code. When I try the code below my file size increases when the original idea was to add only the rows with matching bbl numbers from two dataframes.
#reference dataframe that is only one column of numbers
dfr = pd.read_excel('C:/pythonstuff/CRMBBLS.xlsx')
reflist = dfr['bbl'].tolist()
dfr['bbl'] = dfr['bbl'].astype(str)
chunksize = 1000
for chunk in pd.read_csv('C:/pythonstuff/pluto_21v3.csv', chunksize=chunksize):
#dictionary to remap borough to number
di = {"MN": 1, "BX": 2, "BK": 3, "QN": 4, "SI": 5}
#add leading zeros to prepare concatenation of columns
chunk["borough"] = chunk["borough"].map(di)
chunk['block'] = chunk['block'].apply(lambda x: '{0:0>5}'.format(x))
chunk['lot'] = chunk['lot'].apply(lambda x: '{0:0>4}'.format(x))
#create our bbl column to compare to our reference dataframe (bbl in dfr)
chunk["bbl"] = chunk["borough"].astype(str) + chunk["block"].astype(str) + chunk["lot"].astype(str)
mergedStuff = pd.merge(dfr, chunk, on=['bbl'], how='inner')
chunk.to_csv("C:/pythonstuff/final.csv",
header=header, mode='a')
I'm guessing the merge part is merging everything and not throwing away the rows I don't want then appending it to a massive csv file.
I am truly a beginner at computer science. If this is completely off the mark I would appreciate at least a cardinal direction pointed at to start being able to help myself. I can't even parse the documentation half the time. Thank you.