Extracting 2 values from list with for-loop

Question

I have a large Excel-sheet with has one column that contains several different identifiers (e.g. ISBNs). I have converted the sheet to a pandas dataframe and transformed the column with the identifiers to a list. A list entry of one row of the original column looks like this:

'ISBN:978-9941-30-551-1 Broschur :  GEL 14.90, IDN:1215507534'

However, they aren't all the same, there are some with ISBNs, some don't have one, some have more entries, some less (5 in the example above) and the different IDs are mostly, but not all, separated by a comma.

In the next step, I have build a function that runs through the various list-items (one long string like the one above) and then splits this into the different words (so I get something like

'ISBN:978-9941-30-551-1', 'Broschur :', 'GEL', '14.90', 'IDN:1215507534'

I am looking to extract the values for ISBN and IDN, where present, to then add a designated column for ISBN and one for IDN to my original dataframe (instead of the "identifier"-column that contains the mixed data).

I now have the following code, which kind of does what it's supposed to, only I end up with lists in my dictionary and therefore a list for each entry in the resulting dataframe. I am sure there must be a better way of doing this, but cannot seem to think of it...

def find_stuff(item): 
        
    list_of_words = item.split()
    ISBN = list()
    IDN = list()
    
    for word in list_of_words:

        if 'ISBN' in word: 
            var = word
            var = var.replace("ISBN:", "")
            ISBN.append(var)
             
        if 'IDN' in word: 
            var2 = word
            var2 = var2.replace("IDN:", "")
            IDN.append(var2)

    
    sum_dict = {"ISBN":ISBN, "IDN":IDN}
    
    return sum_dict



output = [find_stuff(item) for item in id_lists]
print(output)

Any help very much appreciated :)

Can you check if [my answer](https://stackoverflow.com/a/68567674/16343464) works for you? It is much more efficient that using a custom function to loop manually over the text. If you would like a different output or advice on post-processing, please provide the expected output and use case. — mozway, Jul 29 '21 at 08:02

score 1 · Accepted Answer · answered Jul 28 '21 at 21:36

Since you are working in pandas I suggest using pandas' string methods to extract the relevant information and assign them to a new column directly. In the answer below I demonstrate some possibilities:

import pandas as pd

df = pd.DataFrame(['ISBN:978-9941-30-551-1 Broschur :  GEL 14.90, IDN:1215507534'], columns=['identifier'])

def retrieve_text(lst, text):
    try:
        return [i for i in lst if text in i][0]
    except:
        return None

df['ISBN'] = df['identifier'].str.split().apply(lambda x: retrieve_text(x, 'ISBN')) #use a custom function to filter the list
df['IDN'] = df['identifier'].str.split().apply(lambda x: retrieve_text(x, 'IDN'))
df['name'] = df['identifier'].str.split().str[1] #get by index
df['price'] = df['identifier'].str.extract(r'(\d+\.\d+)').astype('float') #use regex, no need to split the string here

Output:

	identifier	ISBN	IDN	name	price
0	ISBN:978-9941-30-551-1 Broschur : GEL 14.90, IDN:1215507534	ISBN:978-9941-30-551-1	IDN:1215507534	Broschur	14.9

Thank you very much, this works brilliantly! Not sure yet what the lambda-part does exactly so far, will look into it to learn. Thank you! — ssp24, Jul 29 '21 at 08:05

mozway · Answer 2 · 2021-07-29T07:59:35.023

You don't need your function, just apply a regex with named groups to the original column containing the long string.

Let's imagine this example:

df = pd.DataFrame({'other_column': ['blah', 'blah'],
                   'identifier': ['ISBN:978-9941-30-551-1 Broschur :  GEL 14.90, IDN:1215507534',
                                  'ISBN:123-4567-89-012-3 blah IDN:1234567890 other'
                                 ],
                  })

  other_column                                                    identifier
0         blah  ISBN:978-9941-30-551-1 Broschur :  GEL 14.90, IDN:1215507534
1         blah              ISBN:123-4567-89-012-3 blah IDN:1234567890 other

If ISBN is always before IDN, you can use pandas.Series.str.extract:

df['identifier'].str.extract('(?P<ISBN>ISBN:[\d-]+).*(?P<IDN>IDN:\d+)')

output:

                     ISBN             IDN
0  ISBN:978-9941-30-551-1  IDN:1215507534
1  ISBN:123-4567-89-012-3  IDN:1234567890

If there is a chance that there are not always in this order then use pandas.Series.str.extractall and rework the output with groupby:

(df['identifier'].str.extractall('(?P<ISBN>ISBN:[\d-]+)|(?P<IDN>IDN:\d+)')
                 .groupby(level=0).first()
)

Finally, if you don't want the identifier names, change a bit the regex to '(?:ISBN:(?P<ISBN>[\d-]+))|(?:IDN:(?P<IDN>\d+))':

(df['identifier'].str.extractall('(?:ISBN:(?P<ISBN>[\d-]+))|(?:IDN:(?P<IDN>\d+))')
                 .groupby(level=0).first()
)

output:

                ISBN         IDN
0  978-9941-30-551-1  1215507534
1  123-4567-89-012-3  1234567890

NB. If you need a dictionary as output, you can append .to_dict('index') at the end of your command. This gives you

{0: {'ISBN': '978-9941-30-551-1', 'IDN': '1215507534'},
 1: {'ISBN': '123-4567-89-012-3', 'IDN': '1234567890'}}

Thanks a lot! I will have to get more comfortable with regular expressions, it seems. — ssp24, Jul 29 '21 at 08:08
Well, this is for sure an investment to make, but this is very powerful. For instance you could [check that the ISBN format is correct](https://stackoverflow.com/questions/41271613/use-regex-to-verify-an-isbn-number), or many other things… — mozway, Jul 29 '21 at 08:13

Extracting 2 values from list with for-loop

2 Answers2