-1

I have a value that I have stored in a string. I would like to append that value only to rows that meet certain criteria, and not to any others.

The following image shows the tables I need to parse. I can easily parse the file with BeautifulSoup and turn it into a Pandas DataFrame, but for both of the tables below I'm struggling to capture and append the Package prices to the entire DataFrame. Ideally the Price values would go alongside every Fish-Weight pair; so a single column of the same Price value.

enter image description here

Here is the code I use to parse the tables:

with open(file_path) as in_f:
    msg = email.message_from_file(in_f) #type: <class 'email.message.Messgae'>

html_msg = msg.get_payload(1)   #type: <class 'email.message.Message'>

body = html_msg.get_payload(decode=True)    #type: <class 'bytes'> or type: 'int'

html = body.decode()    #type: <class 'str'>

tablez = BeautifulSoup(html).find_all("table")  #type: <class 'bs4.element.ResultSet'>
data = []
for table in tablez:
    for row in table.find_all("tr"):
        data.append([cell.text.strip() for cell in row.find_all("td")])

fish_frame = pd.DataFrame(data)

This is what data is:

data: [['Species', 'Price', 'Weight'], ['GBW Cod', '.55', '8,059'], ['GBE Haddock', '.03', '14,628'], ['GBW Haddock', '.02', '87,451'], ['GB YT', '1.50', '1,818'], ['Witch', '1.25', '1,414'], ['GB Winter', '.40', '23,757'], ['Redfish', '.02', '123'], ['White Hake', '.40', '934'], ['Pollock', '.02', '7,900'], ['Package Price:', '', '$21,151.67'], ['Species', 'Weight'], ['GBE Cod', '820'], ['GBW Cod', '15,279'], ['GBE Haddock', '32,250'], ['GBW Haddock', '192,793'], ['GB YT', '6,239'], ['SNE YT', '2,018'], ['GOM YT', '1,511'], ['Plaice', '2,944'], ['Witch', '1,100'], ['GB Winter', '158,608'], ['White Hake', '31'], ['Pollock', '1,983'], ['SNE Winter', '7,257'], ['Price', '$58,500.00'], ['Species', 'Weight'], ['GBE Cod', '792'], ['GBW Cod', '14,767'], ['GBE Haddock', '29,199'], ['GBW Haddock', '174,556'], ['GB YT', '5,268'], ['SNE YT', '544'], ['GOM YT', '1,957'], ['Plaice', '2,452'], ['Witch', '896'], ['GB Winter', '163,980'], ['White Hake', '8'], ['Pollock', '1,743'], ['SNE Winter', '3,709'], ['Price', '$57,750.00']]

And then I use this bit of code to capture the Package price:

stew = BeautifulSoup(html, 'html.parser')
chunks = stew.find_all('p', {'class' : "MsoNormal"})        
for line in chunks:
    if 'Package' in line.text:
        package_price = line.text
        print("package_price:", package_price)

But I'm now struggling to add that Price value to its own column in the dataframe. Doing a command such as fish_frame = pd.DataFrame(package_price) results in:

Traceback (most recent call last): File "Z:/Code/NEFS_stock_then_weight_attempt3.py", line 236, in <module> fish_frame = pd.DataFrame(package_price) File "C:\Users\stephen.mahala\AppData\Local\Programs\Python\Python35-32\lib\site-packages\pandas\core\frame.py", line 345, in __init__ raise PandasError('DataFrame constructor not properly called!') pandas.core.common.PandasError: DataFrame constructor not properly called!

due to reasons that are unknown to me. Turning it into a list, however, results in the string being broken up and each character becoming its own list, and therefore each of those becomes its own cell in the DataFrame.

Is there a method with Pandas or with BeautifulSoup that I'm unaware of that will simplify the process of adding this single value to my DataFrame?

theprowler
  • 3,138
  • 11
  • 28
  • 39
  • You should amend your question to show the full, specific traceback of the error you receive. – David Zemens Jun 27 '17 at 18:59
  • I create `fish_frame` right after parsing the table, in my first chunk of code – theprowler Jun 27 '17 at 19:02
  • Yes, I see how you create/initialize it can you show the *full* traceback? – David Zemens Jun 27 '17 at 19:03
  • Ok. Full error message added. – theprowler Jun 27 '17 at 19:05
  • And what exactly are you "turning in to a list"? – David Zemens Jun 27 '17 at 19:05
  • `fish_frame = pd.DataFrame(package_price)` doesn't exist in the code you provided. You do have `fish_frame = pd.DataFrame(data)`. Where is the code that actually raises this error? – David Zemens Jun 27 '17 at 19:06
  • I removed `fish_frame = pd.DataFrame(package_price)` because it causes the error. My bit of code that captures the `Package` price captures the entire line just above the data table; `#891-2: Package for $21,151.67 but willing to sell species individually`. I would like that entire captured line to be appended to each Fish-Weight pair that it corresponds to; so just that first table, not the second one as it doesn't apply to the second one. – theprowler Jun 27 '17 at 19:09
  • please remove the second table if it's not necessary. Also, please include markup html representation of the table, if you expect anyone to actually be able to debug this and help you. – David Zemens Jun 27 '17 at 19:20
  • 1
    What is the *exact* output of `data` used to construct the dataframe? – Alexander Jun 27 '17 at 19:24
  • You may also want to read this about asking pandas questions: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Alexander Jun 27 '17 at 19:30
  • @DavidZemens the second table is necessary though to indicate that I cannot blindly assign the price value to the entire DataFrame, only to specific rows in the DataFrame. – theprowler Jun 28 '17 at 13:30
  • @DavidZemens thanks for zero help Davey – theprowler Jun 28 '17 at 13:34
  • @Alexander `data` has been added. It's a list of lists – theprowler Jun 28 '17 at 13:49
  • OK you've provided `data` now, so that is helpful. `fish_frame = pd.DataFrame(data)` creates a dataframe from the `data`, no problems with that? Now when you do `fish_frame = pd.DataFrame(package_price)` you're trying to create a new/separate data frame? (I suppose you're not, but by re-assigning to `fish_frame` that is what you're doing. Currently, your *entire* data frame contains *both* tables. Are you expecting that? – David Zemens Jun 28 '17 at 14:09
  • Yes, the goal is to capture all the tables, I need all the data. I guess I'm not sure why I thought `fish_frame = pd.DataFrame(package_price)` would add or append that price value to the DataFrame. I did try doing `fish_frame = pd.DataFrame.add(package_price)` but a slew of errors occur when I try that. So I'm not sure if I'm unaware of a simple way to add a value to a DataFrame – theprowler Jun 28 '17 at 14:33

1 Answers1

1

When I create your fish_frame from pd.DataFrame(data), I get the following which consists of both sets of tabular data:

                 0           1           2
0          Species       Price      Weight
1          GBW Cod         .55       8,059
2      GBE Haddock         .03      14,628
3      GBW Haddock         .02      87,451
4            GB YT        1.50       1,818
5            Witch        1.25       1,414
6        GB Winter         .40      23,757
7          Redfish         .02         123
8       White Hake         .40         934
9          Pollock         .02       7,900
10  Package Price:              $21,151.67
11         Species      Weight        None
12         GBE Cod         820        None
13         GBW Cod      15,279        None
14     GBE Haddock      32,250        None
15     GBW Haddock     192,793        None
16           GB YT       6,239        None
17          SNE YT       2,018        None
18          GOM YT       1,511        None
19          Plaice       2,944        None
20           Witch       1,100        None
21       GB Winter     158,608        None
22      White Hake          31        None
23         Pollock       1,983        None
24      SNE Winter       7,257        None
25           Price  $58,500.00        None
26         Species      Weight        None
27         GBE Cod         792        None
28         GBW Cod      14,767        None
29     GBE Haddock      29,199        None
30     GBW Haddock     174,556        None
31           GB YT       5,268        None
32          SNE YT         544        None
33          GOM YT       1,957        None
34          Plaice       2,452        None
35           Witch         896        None
36       GB Winter     163,980        None
37      White Hake           8        None
38         Pollock       1,743        None
39      SNE Winter       3,709        None
40           Price  $57,750.00        None

If you get rid of the outer loop for table in tablez: and just do for row in tablez[0] I think you'll end up with:

data = [['Species', 'Price', 'Weight'], ['GBW Cod', '.55', '8,059'],
        ['GBE Haddock', '.03', '14,628'], ['GBW Haddock', '.02', '87,451'], 
        ['GB YT', '1.50', '1,818'], ['Witch', '1.25', '1,414'], 
        ['GB Winter', '.40', '23,757'], ['Redfish', '.02', '123'], 
        ['White Hake', '.40', '934'], ['Pollock', '.02', '7,900'], 
        ['Package Price:', '', '$21,151.67']]

And then fish_frame=pd.DataFrame(data) will result in:

                 0      1           2
0          Species  Price      Weight
1          GBW Cod    .55       8,059
2      GBE Haddock    .03      14,628
3      GBW Haddock    .02      87,451
4            GB YT   1.50       1,818
5            Witch   1.25       1,414
6        GB Winter    .40      23,757
7          Redfish    .02         123
8       White Hake    .40         934
9          Pollock    .02       7,900
10  Package Price:         $21,151.67

Whether you make that change or not, this will add a column to the fish_frame:

srs = pd.Series([package_price]*len(fish_frame))
fish_frame[3] = pd.Series(srs,index=fish_frame.index)

And you should end up then with:

                 0      1           2    3
0          Species  Price      Weight    #891-2: Package for $21,151.67 but willing to sell species individually
1          GBW Cod    .55       8,059    #891-2: Package for $21,151.67 but willing to sell species individually
2      GBE Haddock    .03      14,628    #891-2: Package for $21,151.67 but willing to sell species individually
3      GBW Haddock    .02      87,451    #891-2: Package for $21,151.67 but willing to sell species individually
4            GB YT   1.50       1,818    #891-2: Package for $21,151.67 but willing to sell species individually
...
David Zemens
  • 53,033
  • 11
  • 81
  • 130
  • Ok ok that appears like a perfect printout, I'll try it now in my code. But just so I understand correctly, instead of attacking all the tables at once and putting them all directly into a DataFrame, you changed it up so that Python will instead go row by row to capture the data, is that right or? Also, I never fully understood what `Series` had to do with `Pandas` (I am a novice if that wasn't obvious) but I'm seeing now that it is used for editing a DataFrame..? – theprowler Jun 28 '17 at 14:37
  • well I don't know about the first thing, is it necessary for all (2) tables to exist in your DataFrame? Only you can answer that. I would think to create TWO separate DataFrame objects if you need both, otherwise only create the DF from the table that you're operating against. As for `Series`, IDK, this is literally the first time I've ever used Pandas. – David Zemens Jun 28 '17 at 14:39
  • You're already going "row by row", but you're doing "for each table go row by row" and my suggestion is that you probably only need the data from the *first* table inside your `fish_frame`, so you would omit `for table in tablez` and simply iterate `for row in tablez[0]`. – David Zemens Jun 28 '17 at 14:40