0

I am having some trouble creating a pandas df from lists I generate while scraping data from the web. Here I am using beautifulsoup to pull a few pieces of information about local farms from localharvest.org (farm name, city, and description). I am able to scrape the data effectively, creating a list of objects on each pass. The trouble I'm having is outputting these lists into a tabular df.

My complete code is as follows:

import requests
from bs4 import BeautifulSoup
import pandas

url = "http://www.localharvest.org/search.jsp?jmp&lat=44.80798&lon=-69.22736&scale=8&ty=6"
r = requests.get(url)
soup = BeautifulSoup(r.content)


data = soup.find_all("div", {'class': 'membercell'})

fname = []
fcity = []
fdesc = []

for item in data:
    name = item.contents[1].text
    fname.append(name)
    city = item.contents[3].text
    fcity.append(city)
    desc = item.find_all("div", {'class': 'short-desc'})[0].text
    fdesc.append(desc)

df = pandas.DataFrame({'fname': fname, 'fcity': fcity, 'fdesc': fdesc})

print (df)

df.to_csv('farmdata.csv')

Interestingly, the print(df) function shows that all three lists have been passed to the dataframe. But the resultant .CSV output contains only a single column of values (fcity) with the fname and fdesc column labels present. Interstingly, If I do something crazy like try to force tab delineated output with df.to_csv('farmdata.csv', sep='\t'), I get a single column with jumbled output, but it appears to at least be passing the other elements of the dataframe.

Thanks in advance for any input.

JeremyD
  • 125
  • 1
  • 3
  • 8

3 Answers3

1

It works for me:

# Taking a few slices of each substring of a given string after stripping off whitespaces
df['fname'] = df['fname'].str.strip().str.slice(start=0, stop=20)
df['fdesc'] = df['fdesc'].str.strip().str.slice(start=0, stop=20)
df.to_csv('farmdata.csv')
df

                fcity                 fdesc                 fname
0  South Portland, ME  Gromaine Farm is pro         Gromaine Farm
1         Newport, ME  We are a diversified    Parker Family Farm
2           Unity, ME  The Buckle Farm is a       The Buckle Farm
3      Kenduskeag, ME  Visit wiseacresfarm.       Wise Acres Farm
4      Winterport, ME  Winter Cove Farm is       Winter Cove Farm
5          Albion, ME  MISTY BROOK FARM off      Misty Brook Farm
6  Dover-Foxcroft, ME  We want you to becom           Ripley Farm
7         Madison, ME  Hide and Go Peep Far  Hide and Go Peep Far
8            Etna, ME  Fail Better Farm is       Fail Better Farm
9      Pittsfield, ME  We are a family farm  Snakeroot Organic Fa

Maybe you had a lot of empty spaces which was misinterpreted by the default delimiter(,) and hence picked up fcity column as it contained(,) in it which led to the ordering getting affected.

Nickil Maveli
  • 29,155
  • 8
  • 82
  • 85
  • 1
    This also works. Stripping out excessive newline characters and spaces seems to be the key. Thank you for your help! – JeremyD Aug 19 '16 at 16:01
1

Consider, instead of using lists of the information for each farm entity that you scrape, to use a list of dictionaries, or a dict of dicts. eg:

[{name:farm1, city: San Jose... etc},
{name: farm2, city: Oakland...etc}]

Now you can call Pandas.DataFrame.from_dict() on the above defined list of dicts.

Pandas method: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html

An answer that might describe this solution in more detail: Convert Python dict into a dataframe

Community
  • 1
  • 1
Mansweet
  • 161
  • 7
1

Try stripping out the newline and space characters:

import requests
from bs4 import BeautifulSoup
import pandas

url = "http://www.localharvest.org/search.jsp?jmp&lat=44.80798&lon=-69.22736&scale=8&ty=6"
r = requests.get(url)
soup = BeautifulSoup(r.content)


data = soup.find_all("div", {'class': 'membercell'})

fname = []
fcity = []
fdesc = []

for item in data:
    name = item.contents[1].text.split()
    fname.append(' '.join(name))
    city = item.contents[3].text.split()
    fcity.append(' '.join(city))
    desc = item.find_all("div", {'class': 'short-desc'})[0].text.split()
    fdesc.append(' '.join(desc))

df = pandas.DataFrame({'fname': fname, 'fcity': fcity, 'fdesc': fdesc})

print (df)

df.to_csv('farmdata.csv')
zarak
  • 2,933
  • 3
  • 23
  • 29
  • 1
    This worked perfectly. I assumed the problem was coming from the desc field. I notice beautifulsoup tends to add lots of newline characters. Someone else had advocated using `.split()` to remove the newline characters. This is a great help. Thank you. – JeremyD Aug 19 '16 at 15:45
  • If you were using a spreadsheet program to view the csv file, my guess is that the newline characters made it look like the cell was empty when in fact it was only displaying the first (empty) line. Glad you got it working. Please consider upvoting and/or [accepting](http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work) the answer(s) you found helpful :) – zarak Aug 19 '16 at 15:55