Q: Pandas dataframe from for loop

Question

EDIT 2, 9/1 See my answer below!

Pretty new at Python and Pandas here. I've got a script here that uses a for loop to query my database using each line in my list. That all works great, but I can't figure out how to build a data frame from the results of that loop. Any and all pointers are welcome!

#Remove stuff
print "Cleaning list"
def multiple_replacer(key_values):
  replace_dict = dict(key_values)
  replacement_function = lambda match: replace_dict[match.group(0)]
  pattern = re.compile("|".join([re.escape(k) for k, v in key_values]), re.M)
  return lambda string: pattern.sub(replacement_function, string)

multi_line = multiple_replacer(key_values)

print "Querying Database..."
for line in source:
  brand_url = multi_line(line)
  #Run Query with cleaned list
  mysql_query = ("select ub.url as 'URL', b.name as 'Name', b.id as 'ID' from api.brand b join api.url_brand ub on b.id=ub.brand_id where ub.url like '%%%s%%' and b.deleted=0 group by 3;" % brand_url)
  list1 = []
  brands = my_query('prod', mysql_query)

print "Writing CSV..."
#Create DF and CSV
for row in brands:
  list1.append({"URL":row['URL'],"Name":['Name'],"ID":['ID']})
  if brands.shape == (3,0):
    df1 = pd.DataFrame(data = brands, columns=['URL','Name','ID'])
  output = df1.to_csv('ongoing.csv',index=False)

EDIT 8/30 Here is my edit, attempting to use zyxue's method:

#Remove stuff
print "Cleaning list"
def multiple_replacer(key_values):
  replace_dict = dict(key_values)
  replacement_function = lambda match: replace_dict[match.group(0)]
  pattern = re.compile("|".join([re.escape(k) for k, v in key_values]), re.M)
  return lambda string: pattern.sub(replacement_function, string)

multi_line = multiple_replacer(key_values)

print "Querying Database..."
for line in source:
  brand_url = multi_line(line)
  #Run Query with cleaned list
  mysql_query = ("select ub.url as 'URL', b.name as 'Name', b.id as 'ID' from api.brand b join api.url_brand ub on b.id=ub.brand_id where ub.url like '%%%s%%' and b.deleted=0 group by 3;" % brand_url)
  brands = my_query('prod', mysql_query)

print "Writing CSV..."
#Create DF and CSV
records = []
for row in brands:
  records.append({"URL":row['URL'],"Name":['Name'],"ID":['ID']})
  if brands.shape == (3,0):
    records.append(dict(zip(brands, ['URL', 'Name', 'ID'])))
df1 = pd.DataFrame.from_records(records)
output = df1.to_csv('ongoing.csv', index=False)

but this only returns a blank CSV. I'm sure I'm applying it wrong.

you're overwriting your dataframe with each iteration of your loop. — Paul H, Aug 28 '15 at 22:40
Yes, I figured that after I posted. Not sure on a solution, however. Still very new at this and have only worked with static lists, not a for loop, for something like this. — sstewart, Aug 29 '15 at 01:19
http://stackoverflow.com/questions/17530542/how-to-add-pandas-data-to-an-existing-csv-file — JohnE, Aug 29 '15 at 12:50
@JohnE, what specifically should I be looking at in that question? The answers there assume my CSV already exist. — sstewart, Aug 30 '15 at 15:31
Well, the CSV does exist after you write to it once, right? You just want to append instead of over-writing. Specifically, I was looking at the answer with the most votes (it wasn't the accepted answer though) using the `mode='a'` option. I have not tested it, just thought it looked like a good approach. — JohnE, Aug 30 '15 at 15:41
Ah, yes. That makes sense. Tried it, but to no avail. I suppose if I created a CSV file before I did the for loop it would work, although I would need to clear that file every time I used to script, no? — sstewart, Aug 30 '15 at 19:25

zyxue · Answer 1 · 2015-08-29T19:14:56.787

1

records = []
for row in brands:
    # if brands.shape == (3,0):
    #     records.append(dict(zip(brands, ['URL', 'Name', 'ID'])))
    # update bug fix:
    if row.shape == (3,0):
        records.append(dict(zip(row, ['URL', 'Name', 'ID'])))

df1 = pd.DataFrame.from_records(records)
output = df1.to_csv('ongoing.csv', index=False)
# ref:
# >>> pd.DataFrame.from_records([{'a': 1, 'b':2}, {'a': 11, 'b': 22}])
#     a   b
#     0   1   2
#     1  11  22

edited Aug 29 '15 at 19:14

answered Aug 28 '15 at 22:54

zyxue

7,904
5
48
74

I attempted to implement this solution and it simply returned an empty set. Perhaps I'm not setting it up properly in my code? – sstewart Aug 29 '15 at 01:20
I updated my question above with what it looks like after applying your code. – sstewart Aug 29 '15 at 15:44
First, put this line `records.append({"URL":row['URL'],"Name":['Name'],"ID":['ID']})` out of the for loop. You shouldn't append the header for every element of brands. Second, please show a couple of examples of brands (e.g. what does `print brands[:2]` looks like?). Also make sure `brands` does have elements of shape (3,0) – zyxue Aug 29 '15 at 15:49
Thanks for working through this with me. here is a sample output when using `print brands`: `[{'URL': 'http://teespring.com', 'Name': 'Teespring', 'ID': 12345L}]` – sstewart Aug 29 '15 at 18:59
Sorry, there was a bug in my code, it should've been `row.shape` and `zip(row, [...])`. I have updated my answer. **However**, if `brands` is a list of dictionaries already, simply try need to `df1 = pd.DataFrame.from_records(brands)` and you should get the proper `df1` you want, no for loop needed. – zyxue Aug 29 '15 at 19:16
Doesn't seem to work for me. I'm not sure what the issue is, because from what I can tell, your code **should** work. – sstewart Aug 30 '15 at 15:24

sstewart · Accepted Answer · 2015-09-01T12:40:56.903

Okay, I figured it out, and I thought I should post the working script. @zyxue was pretty much right.

source = open('urls.txt')
key_values = ("http://",""), ("https://",""), ("www.",""), ("\n","")

#Remove stuff
print "Cleaning list"
def multiple_replacer(key_values):
  replace_dict = dict(key_values)
  replacement_function = lambda match: replace_dict[match.group(0)]
  pattern = re.compile("|".join([re.escape(k) for k, v in key_values]), re.M)
  return lambda string: pattern.sub(replacement_function, string)

multi_line = multiple_replacer(key_values)

print "Querying Database..."
records = []
for line in source:
  brand_url = multi_line(line)
  #Run Query with cleaned list
  mysql_query = ("select ub.url as 'URL', b.name as 'Name', b.id as 'ID' from api.brand b join api.url_brand ub on b.id=ub.brand_id where ub.url like '%%%s%%' and b.deleted=0 group by 3;" % brand_url)
  brands = my_query('prod', mysql_query)
  #Append results to dict (records)
  for row in brands:
      records.append({"URL":row['URL'],"Name":row['Name'],"ID":row['ID']})

#Create DataFrame
df = pd.DataFrame.from_dict(records)

#Create CSV    
output = df.to_csv('ongoing.csv',index=False)

Essentially, I needed to layer the second for loop under the first and create the 'records' dictionary before the looping began. This causes an append to the dictionary for every line in 'source'. Seems like a pretty simple concept now!

Q: Pandas dataframe from for loop

2 Answers2