I have a large Excel-sheet with has one column that contains several different identifiers (e.g. ISBNs). I have converted the sheet to a pandas dataframe and transformed the column with the identifiers to a list. A list entry of one row of the original column looks like this:
'ISBN:978-9941-30-551-1 Broschur : GEL 14.90, IDN:1215507534'
However, they aren't all the same, there are some with ISBNs, some don't have one, some have more entries, some less (5 in the example above) and the different IDs are mostly, but not all, separated by a comma.
In the next step, I have build a function that runs through the various list-items (one long string like the one above) and then splits this into the different words (so I get something like
'ISBN:978-9941-30-551-1', 'Broschur :', 'GEL', '14.90', 'IDN:1215507534'
I am looking to extract the values for ISBN and IDN, where present, to then add a designated column for ISBN and one for IDN to my original dataframe (instead of the "identifier"-column that contains the mixed data).
I now have the following code, which kind of does what it's supposed to, only I end up with lists in my dictionary and therefore a list for each entry in the resulting dataframe. I am sure there must be a better way of doing this, but cannot seem to think of it...
def find_stuff(item):
list_of_words = item.split()
ISBN = list()
IDN = list()
for word in list_of_words:
if 'ISBN' in word:
var = word
var = var.replace("ISBN:", "")
ISBN.append(var)
if 'IDN' in word:
var2 = word
var2 = var2.replace("IDN:", "")
IDN.append(var2)
sum_dict = {"ISBN":ISBN, "IDN":IDN}
return sum_dict
output = [find_stuff(item) for item in id_lists]
print(output)
Any help very much appreciated :)