Reading .csv in chunks, keep only the rows that have a numbers matching to a reference dataframe, then write those chunks into another csv

Question

I am pretty sure I cobbled together the wrong parts of other people's code. When I try the code below my file size increases when the original idea was to add only the rows with matching bbl numbers from two dataframes.

#reference dataframe that is only one column of numbers
dfr = pd.read_excel('C:/pythonstuff/CRMBBLS.xlsx')
reflist = dfr['bbl'].tolist()
dfr['bbl'] = dfr['bbl'].astype(str)


chunksize = 1000
for chunk in pd.read_csv('C:/pythonstuff/pluto_21v3.csv', chunksize=chunksize):
    #dictionary to remap borough to number
    di = {"MN": 1, "BX": 2, "BK": 3, "QN": 4, "SI": 5}
    #add leading zeros to prepare concatenation of columns
    chunk["borough"] = chunk["borough"].map(di)
    chunk['block'] = chunk['block'].apply(lambda x: '{0:0>5}'.format(x))
    chunk['lot'] = chunk['lot'].apply(lambda x: '{0:0>4}'.format(x))
    #create our bbl column to compare to our reference dataframe (bbl in dfr)
    chunk["bbl"] = chunk["borough"].astype(str) + chunk["block"].astype(str) + chunk["lot"].astype(str)
    mergedStuff = pd.merge(dfr, chunk, on=['bbl'], how='inner')
    chunk.to_csv("C:/pythonstuff/final.csv",
        header=header, mode='a')

I'm guessing the merge part is merging everything and not throwing away the rows I don't want then appending it to a massive csv file.

I am truly a beginner at computer science. If this is completely off the mark I would appreciate at least a cardinal direction pointed at to start being able to help myself. I can't even parse the documentation half the time. Thank you.

Welcome to StackOverflow! I see you're using an inner merge, which is a reasonable approach. I don't see any issues with the code. Could you post a short sample of each dataframe so we can test this code on our own systems? Here's a good guide on how to post a dataset: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — Nick ODell, Nov 19 '21 at 00:15
Hey thank you for reaching out. I'm using jupyter ntoebooks to do stuff so I need to look at that link better. But real quick the reference file I'm using is ~8700 lines by one column. The database pluto.csv file is (858,926rows, 69columns) and is 360MB before and the final output file is not much smaller. It should be a lot smaller I would think. Not sure if that helps but I will try to show some outputs if you suggest it. — oglebee, Nov 19 '21 at 14:47
@oglebee The most important step is to find a subset of your data which continues to show the problem. For example, if you take the first 6 rows of both input dataset, and run your code, is the output of that smaller problem correct? Generally, if there's a problem when processing gigabytes of data, that problem will also show up at much smaller data sizes, which is easier for StackOverflow users to understand. — Nick ODell, Nov 19 '21 at 16:26
Nick and Nikita, I apologize there is a glaring error in my code. I needed to be appending the mergedStuff dataset to the csv not the chunk. The 'dfr['bbl']' is all unique numbers. After going through the rest of the database PLUTO file I found I could throw away so much of it that I could parse it using simple pandas and python methods which allowed me to check my work. IDK how you guys wake up in the morning but it's incredible you fond my post and reached out to help. — oglebee, Nov 19 '21 at 18:56

Reading .csv in chunks, keep only the rows that have a numbers matching to a reference dataframe, then write those chunks into another csv

0 Answers0