How can I keep only unique values in the header and get values corresponding to these in different rows?

Question

I have a link, and within that link, I have some products. Within each of these products, there is a table of specifications. The table is such that first column should be the header, and second column the data corresponding to it. The first column for each of these tables is different, with some overlapping categories. I want to get one big table that has all these categories, and in rows, the different products. I am able to get data for one table (one product) as follows:

import requests
import csv
from bs4 import BeautifulSoup 
def cpap_spider(max_pages):
    page=1
    while page<=max_pages:
        url= "https://www.1800cpap.com/cpap-masks/nasal?page=" +str(page)
        source_code= requests.get(url)
        plain_text= source_code.text
        soup= BeautifulSoup(plain_text, 'html.parser')
        for link in soup.findAll("a", {"class":"facets-item-cell-grid-title"}):
            
            href="https://www.1800cpap.com"+link.get("href")
            title= link.string
            each_item(href)
            print(href)
            #print(title)
        page+=1
        
data=[] 
def each_item(item_url):
    source_code= requests.get(item_url)
    plain_text= source_code.text
    soup= BeautifulSoup(plain_text, 'html.parser')
    table=soup.find("table", {"class":"table"})
    
    table_rows= table.find_all('tr')
    for row in table_rows:
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        data.append([ele for ele in cols if ele]) # Get rid of empty values
    b = open('all_appended.csv', 'w')
    a = csv.writer(b)
    a.writerows(data)
    b.close()
    
    
            
cpap_spider(1)

This code gets all the tables appended , one after the other. However, I wanted a single big table with unique headers in the first row, and corresponding values of products in the sequential order.

Navpreet Devpuri · Accepted Answer · 2020-06-26T12:50:58.783

Use xlsxwriter instead of csv because if a text contains a single comma without space next to it "," instead of comma with space next to it ", " then your csv file will cuase problems because each column value is seperated by "," for example if text = "aa,bb" then csv will think like this text contains two columns like "aa" and "bb".

Thats what you need

import requests
import xlsxwriter
from bs4 import BeautifulSoup 
def cpap_spider(max_pages):
    global row_i
    page=1
    while page<=max_pages:
        url= "https://www.1800cpap.com/cpap-masks/nasal?page=" +str(page)
        source_code= requests.get(url)
        plain_text= source_code.text
        soup= BeautifulSoup(plain_text, 'html.parser')
        for link in soup.findAll("a", {"class":"facets-item-cell-grid-title"}):
            href="https://www.1800cpap.com"+link.get("href")
            title = link.string
            worksheet.write(row_i, 0, title)
            each_item(href)
            print(href)
            #print(title)
        page+=1

def each_item(item_url):
    global cols_names, row_i
    source_code= requests.get(item_url)
    plain_text= source_code.text
    soup= BeautifulSoup(plain_text, 'html.parser')
    table=soup.find("table", {"class":"table"})
    if table:
        table_rows = table.find_all('tr')
    else:
        return
    for row in table_rows:
      cols = row.find_all('td')
      for ele in range(0,len(cols)):
        temp = cols[ele].text.strip()
        if temp:
          # Here if you want then you can remove unwanted characters like : ? from temp
          # For example "Actual Weight" and ""
          if temp[-1:] == ":":
            temp = temp[:-1]
          # Name of column
          if ele == 0:
            try:
              cols_names_i = cols_names.index(temp)
            except:
              cols_names.append(temp)
              cols_names_i = len(cols_names) -  1
              worksheet.write(0, cols_names_i + 1, temp)
              continue;
          worksheet.write(row_i, cols_names_i + 1, temp)      
    row_i += 1
    
cols_names=[]
cols_names_i = 0
row_i = 1
workbook = xlsxwriter.Workbook('all_appended.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write(0, 0, "Title")
    
cpap_spider(1)
#each_item("https://www.1800cpap.com/viva-nasal-cpap-mask-by-3b-medical")       
workbook.close()

would i have to add a try catch expression because for some links this is not working? — huy, Jun 26 '20 at 12:22
for some links, there exists no table at all to get the information from. and error comes in the line 'table_rows= table.find_all('tr')'. says none type object — huy, Jun 26 '20 at 12:44
or try to check the value of `table` by using `console.log(table)` — Navpreet Devpuri, Jun 26 '20 at 12:46
i think `table` is `None` and if you put `if table` then check for rows otherwise just continue to next link — Navpreet Devpuri, Jun 26 '20 at 12:48
yes it gives me the required result , i just added a row each time there is no table. thankyou so much! — huy, Jun 26 '20 at 13:09
can you help how i can extend this code for table entries that have both and tags, where the latter one should appear as headers? Thanks in advance — huy, Jul 03 '20 at 16:46
You should try to google something like **"html table to csv javascript"** and you will find great answers like [this](https://stackoverflow.com/a/33807762/8099521) one. — Navpreet Devpuri, Jul 03 '20 at 19:31

score 0 · Answer 2 · answered Jun 26 '20 at 07:31

Assuming that the headers are consistently the first row of each table, you just have to skip that row in every table but the first. A simple way to do that is to store the first row to process in a variable initialized to 0 and set it to 1 in the processing function. Possible code:

def cpap_spider(max_pages):
    page=1
    start_row = 0
    while page<=max_pages:
        ...
        for link in soup.findAll("a", {"class":"facets-item-cell-grid-title"}):
            ...
            each_item(href, start_row)
            start_row = 1        # only first call to each_item will get start_row=1
            print(href)
            #print(title)
        page+=1
...
def each_item(item_url, start_row):
    ...    
    table_rows= table.find_all('tr')
    for row in table_rows[start_row:]:        # skip first row if start_row==1
        ...

this code returns a blank csv file. I don't understand the problem — huy, Jun 26 '20 at 07:45
couls you please also refer to https://stackoverflow.com/questions/62588205/scraping-table-data-from-multiple-links-and-combine-this-together-in-one-excel-f/62589392#62589392 for the complete question. — huy, Jun 26 '20 at 07:53

How can I keep only unique values in the header and get values corresponding to these in different rows?

2 Answers2

Linked