I have a problem that drives me nuts currently. I have a list with a couple of million entries, and I need to extract product categories from them. Each entry looks like this: "[['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Memory Card Adapters']]"
A type check did indeed give me string: print(type(item))
<class 'str'>
Now I searched online for a possible (and preferably fast - because of the million entries) regex solution to extract all the categories.
I found several questions here Match single quotes from python re: I tried re.findall(r"'(\w+)'", item)
but only got empty brackets []
.
Then I went on and searched for alternative methods like this one: Python Regex to find a string in double quotes within a string There someone tries the following matches=re.findall(r'\"(.+?)\"',item)
print(matches)
, but this failed in my case as well...
After that I tried some idiotic approach to get at least a workaround and solve this problem later: list_cat_split = item.split(',')
which gives me
e["[['Electronics'"," 'Computers & Accessories'"," 'Cables & Accessories'"," 'Memory Card Adapters']]"]
Then I tried string methods to get rid of the stuff and then apply a regex:
list_categories = []
for item in list_cat_split:
item.strip('\"')
item.strip(']')
item.strip('[')
item.strip()
category = re.findall(r"'(\w+)'", item)
if category not in list_categories:
list_categories.append(category)
however even this approach failed: [['Electronics'], []]
I searched further but did not find a proper solution. Sorry if this question is completly stupid, I am new to regex, and probably this is a no-brainer for regular regex users?
UPDATE:
Somehow I cannot answer my own question, thererfore here an update: thanks for the answers - sorry for incomplete information, I very rarely ask here and usually try to find solutions on my own.. I do not want to use a database, because this is only a small fraction of my preprocessing work for an ML-application that is written entirely in Python. Also this is for my MSc project, so no production environment. Therefore I am fine with a slower, but working, solution as I do it once and for all. However as far as I can see the solution of @FailSafe worked for me:screenshot of my jupyter notebook here the result with list
But yes I totally agree with @ Wiktor Stribiżew: in a production setup, I would for sure set up a database and let this run over night,.. Thanks for all the help anyways, great people here :-)