Using Pandas and xlrd together. Ignoring absence/presence of column headers

Question

I am hoping you can help me - I'm sure its likely a small thing to fix, when one knows how.

In my workshop, neither I nor my colleagues can make 'find and replace all' changes via the front-end of our database. The boss just denies us that level of access. If we need to make changes to dozens or perhaps hundreds of records it must all be done by copy-and-paste or similar means. Craziness.

I am trying to make a workaround to that with Python 2 and in particular libraries such as Pandas, pyautogui and xlrd.

I have researched serval StackOverflow threads and have managed thus far to write some code that works well at reading a given XL file .In production, this will be a file exported from a found data set in the database GUI front-end and will be just a single column of 'Article Numbers' for the items in the computer workshop. This will always have an Excel column header. E.g

ANR
51234
34567
12345
...

All the records numbers are 5 digit numbers. We also have the means of scanning items with an IR scanner to a 'Workflow' app on the iPad we have and automatically making an XL file out of that list of scanned items.

The XL file here could look something similar to this.

It differs in that there is no column header. All XL files have their data 'anchored' at cell A1 on 'Sheet1" and again just single column will be used. No unnecessary complications here!

Here is the script anyway. When it is fully working system arguments will be supplied to it. For now, let's pretend that we need to change records to have their 'RAM' value changed from
"2GB" to "2 GB".

import xlrd
import string
import re
import pandas as pd


field = "RAM"
value = "2 GB"

myFile = "/Users/me/folder/testArticles.xlsx"
df = pd.read_excel(myFile)
myRegex = "^[0-9]{5}$"


# data collection and putting into lists.
workbook = xlrd.open_workbook(myFile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in     range(sheet.nrows)]

formatted = []
deDuped = []

# removing any possible XL headers, setting all values to strings
# that look like five-digit ints, apply a regex to be sure.
for i in data:
    cellValue = str(i)
    cellValue = cellValue.translate(None, '\'[u]\'')


    # remove the decimal point
    # Searching for the header will cause a database front-end problem. 
    cellValue = cellValue[:-2]
    cellValue = cellValue.translate(None, string.letters)

    # making sure only valid article numbers get through
    # blank rows etc can take a hike
    if len(cellValue) != 0:
        if re.match(myRegex, cellValue):
            formatted.append(cellValue)

# weeding out any possilbe dupes.
for i in formatted:
    if i not in deDuped:
        deDuped.append(i)


#main code block
for i in deDuped:

    #lots going on here involving pyauotgui
    #making sure of no error running searches, checking for warnings, moving/tabbing around DB front-end etc

    #if all goes to plan
    #removing that record number from the excel file and saving the change
    #so that if we run the script again for the same XL file 
    #we don't needlessly update an already OK record again. 

        df = df[~df['ANR'].astype(str).str.startswith(i)]
        df.to_excel(myFile, index=False)

What I really would to like to find out is how can I run the script so that "doesn't care" about the presence or absence of the column header.

df = df[~df['ANR'].astype(str).str.startswith(i)]

Appears to be the line of code where this all hangs on. I've made several changes to the line in different combination but my script always crashes.

If a column header, ("ANR") in my case, is essential for this particular 'pandas' method is there a straight-forward way of inserting a column header into an XL file if it lacks one in the first place - i.e the XL files that come from the IR scanner and the 'Workflow' app on the iPad?

Thanks guys!

UPDATE

I've tried as suggested by Patrick implementing some code to check if cell "A1" has a header or not. Partial success. I can put "ANR" in cell A1 if its missing but I lose whatever was there in the first place.

import xlwt
from openpyxl import Workbook, load_workbook
from xlutils.copy import copy
import openpyxl

# data collection
workbook = xlrd.open_workbook(myFile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]



cell_a1 = sheet.cell_value(rowx=0, colx=0)

if cell_a1 == "ANR":
    print "has header"
else:
    wb = openpyxl.load_workbook(filename= myFile)
    ws = wb['Sheet1']
    ws['A1'] = "ANE"
    wb.save(myFile)
    #re-open XL file again etc etc.

I found this new block of code over at writing to existing workbook using xlwt. In this instance the contributor actually used openpyxl.

Why not create a prepare stept that opens the xlsx, checks `if not A1 == 'ANR':` and if so: insert a cell as `A1` with content `'ANR'` that pushes all other cells 1 down. Now you got the same format for both files and can do the rest without hassle? — Patrick Artner, Sep 02 '18 at 09:47
Thanks for that tip. You're right, that's perhaps the most straight-forward way of addressing the issue. — William Lombard, Sep 02 '18 at 09:48

William Lombard · Accepted Answer · 2018-09-02T12:57:29.840

I think I got it fixed for myself.

Still a tiny bit messy but seems to be working. Added an 'if/else' clause to check the value of cell A1 and to take action accordingly. Found most of the code for this at how to append data using openpyxl python to excel file from a specified row? - using the suggestion for openpyxl

import pyperclip
import xlrd
import pyautogui
import string
import re
import os
import pandas as pd
import xlwt
from openpyxl import Workbook, load_workbook
from xlutils.copy import copy


field = "RAM"
value = "2 GB"
myFile = "/Users/me/testSerials.xlsx"
df = pd.read_excel(myFile)


myRegex = "^[0-9]{5}$"

# data collection
workbook = xlrd.open_workbook(myFile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]

cell_a1 = sheet.cell_value(rowx=0, colx=0)

if cell_a1 == "ANR":
    print "has header"
else:
    headers = ['ANR']
    workbook_name = 'myFile'
    wb = Workbook()
    page = wb.active
    # page.title = 'companies'
    page.append(headers)  # write the headers to the first line

    workbook = xlrd.open_workbook(workbook_name)
    sheet = workbook.sheet_by_index(0)
    data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]

    for records in data:
        page.append(records)

        wb.save(filename=workbook_name)

        #then load the data all over again, this time with inserted header
        workbook = xlrd.open_workbook(myFile)
        sheet = workbook.sheet_by_index(0)
        data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]


formatted = []
deDuped = []

# removing any possible XL headers, setting all values to strings that look like five-digit ints, apply a regex to be sure.
for i in data:
    cellValue = str(i)
    cellValue = cellValue.translate(None, '\'[u]\'')

    # remove the decimal point
    cellValue = cellValue[:-2]
    # cellValue = cellValue.translate(None, ".0")
    cellValue = cellValue.translate(None, string.letters)

    # making sure any valid ANRs get through
    if len(cellValue) != 0:
        if re.match(myRegex, cellValue):
            formatted.append(cellValue)
# ------------------------------------------

# weeding out any possilbe dupes.
for i in formatted:
    if i not in deDuped:
        deDuped.append(i)


# ref - https://stackoverflow.com/questions/48942743/python-pandas-to-remove-rows-in-excel
df = pd.read_excel(myFile)

print df


for i in deDuped:
    #pyautogui code is run here...

    #if all goes to plan update the XL file
        df = df[~df['ANR'].astype(str).str.startswith(i)]

        df.to_excel(myFile, index=False)

Using Pandas and xlrd together. Ignoring absence/presence of column headers

1 Answers1