3

I tried going over How to build and fill pandas dataframe from for loop? but cant seem to write my values to my columns.

Ultimately I am getting data from a webpage and want to put it into a dataframe.

my headers are predefined as:

d1 = pd.DataFrame(columns=['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8', 'col9',
        'col10', 'col11', 'col12', 'col13', 'col14', 'col15', 'col16', 'col17'])

now I have values I get in a for loop, how can I write these rows to each column then repeat back to column 1 to 17 and then next row?

row = soup.find_all('td', attrs = {'class': 'Table__TD'})
for data in row:
    print(data.get_text())

sample output row 1

Mon 11/11
SA
100
31
3-5
60.0
1-3
33.3
1-2
50.0
10
4
0
1
1
2
8

Sample output row 2

Wed 11/13
@CHA
W119-117
32
1-5
20.0
1-5
20.0
0-0
0.0
3
1
0
1
3
3
3

Expected output

enter image description here

Any help would be appreciated.

excelguy
  • 1,574
  • 6
  • 33
  • 67
  • 1
    What does row have? @excelguy – Vishnudev Krishnadas Nov 13 '19 at 03:55
  • check this https://stackoverflow.com/questions/51499385/how-to-add-values-to-a-new-column-in-pandas-dataframe – moys Nov 13 '19 at 04:09
  • I’ll second what @Vishnudev said. We need more information, about the code, where the data comes from, etc. See: [mcve]. – AMC Nov 13 '19 at 04:15
  • `d1.loc[len(d1), col_name] = value` but using a loop to put values in a `dataframe` sounds really bad. I suggest you post a new question with your bigger issue, of entering data into `dataframe`, so people can see if it can be done without a loop at all. – Aryerez Nov 13 '19 at 08:29
  • Added details, hopefully this helps. – excelguy Nov 14 '19 at 00:22

3 Answers3

1

First we have list for column names:

cols = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8', 'col9',
        'col10', 'col11', 'col12', 'col13', 'col14', 'col15', 'col16', 'col17']

Then list for values:

row = [x.get_text() for x in soup.find_all('td', attrs = {'class': 'Table__TD'})]
print(row)
# ['Mon 11/11', 'SA', '100', '31', '3-5', '60.0', '1-3', '33.3', '1-2', '50.0', '10', '4', '0', '1', '1', '2', '8']

Then we can zip the columns and the values together, then append to the dataframe:

d1 = d1.append(dict(zip(cols, row)), ignore_index=True)
print(d1)
#         col1 col2 col3 col4 col5  col6 col7  col8 col9 col10 col11 col12  \
# 0  Mon 11/11   SA  100   31  3-5  60.0  1-3  33.3  1-2  50.0    10     4   
# 
#   col13 col14 col15 col16 col17  
# 0     0     1     1     2     8
Hongpei
  • 677
  • 3
  • 13
  • This may be a good option as my headers will remain static. Issue: how can I iterate through my beautiful soup, as if I keep running my code it appends the same line again and again? – excelguy Nov 16 '19 at 14:16
1

You can try this,

import pandas as pd

columns = [
    'col1',
    'col2',
    'col3',
    'col4',
    'col5',
    'col6',
    'col7',
    'col8',
    'col9',
    'col10',
    'col11',
    'col12',
    'col13',
    'col14',
    'col15',
    'col16',
    'col17',
]

# create dataframe
d1 = pd.DataFrame(columns=columns)

full = []

for data in soup.find_all('td', attrs={'class': 'Table__TD'}):
    full.append(data.get_text())

# seperate full list into sub-lists with 17 elements
rows = [full[i: i+17] for i in range(0, len(full), 17)]

# append list of lists structure to dataframe
d1 = d1.append(pd.DataFrame(rows, columns=d1.columns))
E. Zeytinci
  • 2,642
  • 1
  • 20
  • 37
  • Im getting undefined named 'soups` in my ide. but when I run the code im getting `ValueError: Length mismatch: Expected axis has 0 elements, new values have 17 elements` – excelguy Nov 16 '19 at 14:12
  • 1
    I edited my answer. Can you still share soup with me please? – E. Zeytinci Nov 16 '19 at 16:04
  • 1
    I wrote `soups` for the possibility that you could have more than one `soup`. That's what I mean by iteration. – E. Zeytinci Nov 16 '19 at 17:48
  • hey, i dont have a soups, but i have this, `row = soup.find_all('td', attrs = {'class': 'Table__TD'})` `for data in row: print(data.get_text())` this gets my data I want to append to each column. Does this help? – excelguy Nov 16 '19 at 18:48
  • I updated my answer again. Can you check this again please? – E. Zeytinci Nov 16 '19 at 19:30
  • This is what I want, but `row` should not be static.. it should be dervied from my comment above and repeated for the rows in this dataframe. Does that make sense? – excelguy Nov 16 '19 at 20:38
  • If there are 17 elements every time, it will be added to the dataframe as long as you use `append`. – E. Zeytinci Nov 16 '19 at 20:43
  • Hi sir, for some reason they other rows are not appending (see updated question), I should have a 2nd and 3rd row, etc but just the first row is being appended to the columns – excelguy Nov 16 '19 at 22:08
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/202490/discussion-between-e-zeytinci-and-excelguy). – E. Zeytinci Nov 16 '19 at 22:52
1

Appending data to an existing DataFrame is really slow.

You better created a list of data from soup, creating a new dataframe, then concat the new data frame to your old one

This is a quick benchmark, using an empty df for each case. In your real code, df should be your existing dataframe:

# setup some sample data
headers = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 
           'col8', 'col9', 'col10', 'col11', 'col12', 'col13', 'col14',
           'col15', 'col16', 'col17']
raw_data = 'Mon 11/11,SA,100,31,3-5,60.0,1-3,33.3,1-2,50.0,10,4,0,1,1,2,8'.split(",")
row_dict_data = dict(zip(headers, raw_data))

# append
%%time
df = pd.DataFrame(columns=headers)
for i in range(100):
    df = df.append([row_dict_data])

# CPU times: user 258 ms, sys: 4.82 ms, total: 263 ms
# Wall time: 261 ms


# new dataframe
%%time
df = pd.DataFrame(columns=headers)
df2 = pd.DataFrame([raw_data for i in range(100)], columns=headers)
df3 = pd.concat([df, df2], sort=False)

# CPU times: user 7.03 ms, sys: 1.16 ms, total: 8.2 ms
# Wall time: 7.19 ms
hunzter
  • 554
  • 4
  • 11