0

I am relatively new to Python, coming from a Stata background, and am still struggling with some core Python concepts. For example, i'm currently working on a small program that hits the US Census Bureau API to geocode some addresses and my instinct is to loop over my csv files, feed them into the API call, and then name the output(s) sequentially using the iterator. E.g.

import censusgeocode
import json
import pandas as pd

cg = censusgeocode.CensusGeocode()
for i in range(1,3):
    k = cg.addressbatch('dta/batchfiles/split_test ' + str(i) + '.csv')
    json_string = json.dumps(k)
    test_{i} = pd.read_json(json_string)

I know the test_{i} syntax is incorrect and would return an error but the above gives you a sense of what I am trying to do conceptually. However, i've read elsewhere (e.g. in this SO post) that this is not a good approach in Python. Can someone advise me on what a better approach would be? Is it better to simply append all the ks together into a massive json file and then transform them in one go? If so, how do I go about doing that?

I have hundreds of csv files that I want to loop over, and after calling the API for each I want to append them all together into a single dataframe -- am not sure if that's useful context but just trying to communicate where I want to get to eventually.

Any help would be very much appreciated!

C.Robin
  • 1,085
  • 1
  • 10
  • 23
  • Does this answer your question? [How do I create variable variables?](https://stackoverflow.com/questions/1373164/how-do-i-create-variable-variables) – Mike Scotty Mar 10 '21 at 13:51

2 Answers2

1

Maybe you coud create one main DataFrame and add to it an extra field representing i, Then, while looping over your CSV files, you could load the data as a new DataFrame, add the i-field to every line and append the read data to your main DataFrame.

p.s. in k = cg.addressbatch('dta/batchfiles/split_test ' + str(i) + '.csv') I would recommend the use os.path.join() for this purpose

MarcoP
  • 1,438
  • 10
  • 17
  • Hi Marco, thanks a lot for this. Can I ask what the benefit of using `os.path.join()` is? – C.Robin Mar 10 '21 at 15:12
  • 1
    one advantage is, for example, that if you ran your program on an OS that uses a different separator for path names (e.g. windows vs. linux) it would still work. Also os.path has a set of very useful tools for dealing with paths, it's worth giving it a look :) – MarcoP Mar 10 '21 at 15:14
  • Awesome. Thanks again! I will look into it – C.Robin Mar 10 '21 at 15:20
1

You really do not want to pass through json!

addressbatch returns a nice list of dicts that can directly be used to smoothly feed a pandas DataFrame.

So you have two ways here: build a large list of dicts and build the dataframe in the end:

data = []
for i in range(1,3):
    k = cg.addressbatch('dta/batchfiles/split_test ' + str(i) + '.csv')
    data.extend(k)         # add the new rows to the main list

df = pd.DataFrame(data)

The alternative is to build a dataframe per csv file, store it into a list and concat everything in the end:

dfs = []
for i in range(1,3):
    k = cg.addressbatch('dta/batchfiles/split_test ' + str(i) + '.csv')
    dfs.append(pd.DataFrame(k))

df = pd.concat(dfs)

Both methods are roughly equivalent. In my tests, the first one seems to be slightly more efficient, but with same magnitude order.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252