1

I know how to hard code heading names but I need to generate them from my array. Is this possible?

My data is scraped dynamically, so I cannot hard code my headings or columns

results_headings contains strings such as Animal, Mineral, Vegetable

results_columns contains strings such as Bear, Quartz, Brocolli

My code

#Imports
from bs4 import BeautifulSoup
import requests
import pandas as pd 

#Specify URL & Assign to page object
url = 'http://www.example.com'
page = requests.get(url)

#Grab our page as text
page.text   
soup = BeautifulSoup(page.text, 'html.parser')   #Use the HTML Parser

#Find our information
boxinfo = soup.find("div", {"id": "box1"})
headings = boxinfo.find_all("td", {"class": "label"})
columns = boxinfo.find_all("td")

#Get the headings
results_headings = []
for result in headings:
    result_NoHTML = result.getText()
    results.append(result_NoHTML)

#Get the columns
results_columns = []
for result2 in columns:
    result2_NoHTML = result2.getText()
    results_columns.append(result2_NoHTML)

df = pd.DataFrame(results_headings, results_columns)   
df.to_csv('index.csv', index=False, encoding='utf-8')

Table structure I am scraping from

<div class="box1">

<table class="table1">

<tr><td class="label">Item1</td><td>Value1</td></tr>

<tr><td class="label">Item2</td><td>Value2</td></tr>

<tr><td class="label">Item3</td><td>Value3</td></tr>

<tr><td class="label">Item4</td><td>Value4</td></tr>

</table>

</div>
Ninja2k
  • 819
  • 3
  • 9
  • 34
  • Are you sure you need to work with pandas? Can numpy be a solution if your data are numerical? – Guillaume Jacquenot Jul 03 '18 at 21:00
  • @Guillaume Jacquenot My data is text based, from a Beautiful Soup dataset. – Ninja2k Jul 03 '18 at 21:02
  • Please post a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) so we can better understand your problem. In this case, can you give us a sample dataset or array that you are trying to pull from? What have you tried so far? – BenG Jul 03 '18 at 21:14

3 Answers3

2

So you've scraped your data and end up with a dataframe as below. Note the columns remain unnamed, but the column names appear in the first row without any separation from your data:

df = pd.DataFrame([['Animal', 'Mineral', 'Vegetable'],
                   ['Bear', 'Quartz', 'Brocolli'],
                   ['Turtle', 'Amethyst', 'Asparagus']])

print(df)

        0         1          2
0  Animal   Mineral  Vegetable
1    Bear    Quartz   Brocolli
2  Turtle  Amethyst  Asparagus

You can construct a new dataframe starting from the second row and assign the first row as columns:

df = pd.DataFrame(df.values[1:], columns=df.values[0])

print(df)

   Animal   Mineral  Vegetable
0    Bear    Quartz   Brocolli
1  Turtle  Amethyst  Asparagus
jpp
  • 159,742
  • 34
  • 281
  • 339
  • It works fine for hard coded values but fails when I try to use my values from results_headings/results_columns it will not work. – Ninja2k Jul 03 '18 at 23:09
  • @Ninja2k, It would help if you were able to show us what you are starting with. Your last edit is good, but I still can't *see* what you're currently getting into your dataframe, what's missing and where you can extract column names. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for more advice. – jpp Jul 03 '18 at 23:10
0

You can create a dataframe from a dict generated from results_headings and results_columns

import pandas as pd
results_headings = ['col 1', 'col 2']
results_columns = [('a','bb'), ('ccc','dddd')]
data_dict = {h: c for h, c in zip(results_headings, results_columns)}
df = pd.DataFrame(data_dict)   
df.to_csv('index.csv', index=False, encoding='utf-8')
Guillaume Jacquenot
  • 11,217
  • 6
  • 43
  • 49
0

You can also just use the read_html function for pandas and pass in your table id. I have done this combining bs4 and just isolating the entire table itself then sending that html to the function.

Documentation describes it pretty well: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html

hpca01
  • 370
  • 4
  • 15