0

I have a problem that drives me nuts currently. I have a list with a couple of million entries, and I need to extract product categories from them. Each entry looks like this: "[['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Memory Card Adapters']]" A type check did indeed give me string: print(type(item)) <class 'str'> Now I searched online for a possible (and preferably fast - because of the million entries) regex solution to extract all the categories.

I found several questions here Match single quotes from python re: I tried re.findall(r"'(\w+)'", item) but only got empty brackets []. Then I went on and searched for alternative methods like this one: Python Regex to find a string in double quotes within a string There someone tries the following matches=re.findall(r'\"(.+?)\"',item) print(matches), but this failed in my case as well...

After that I tried some idiotic approach to get at least a workaround and solve this problem later: list_cat_split = item.split(',') which gives me

e["[['Electronics'"," 'Computers & Accessories'"," 'Cables & Accessories'"," 'Memory Card Adapters']]"]

Then I tried string methods to get rid of the stuff and then apply a regex:

list_categories = []
for item in list_cat_split:
    item.strip('\"')
    item.strip(']')
    item.strip('[')
    item.strip()
    category = re.findall(r"'(\w+)'", item)
    if category not in list_categories:
        list_categories.append(category)

however even this approach failed: [['Electronics'], []] I searched further but did not find a proper solution. Sorry if this question is completly stupid, I am new to regex, and probably this is a no-brainer for regular regex users?

UPDATE:

Somehow I cannot answer my own question, thererfore here an update: thanks for the answers - sorry for incomplete information, I very rarely ask here and usually try to find solutions on my own.. I do not want to use a database, because this is only a small fraction of my preprocessing work for an ML-application that is written entirely in Python. Also this is for my MSc project, so no production environment. Therefore I am fine with a slower, but working, solution as I do it once and for all. However as far as I can see the solution of @FailSafe worked for me:screenshot of my jupyter notebook here the result with list

But yes I totally agree with @ Wiktor Stribiżew: in a production setup, I would for sure set up a database and let this run over night,.. Thanks for all the help anyways, great people here :-)

  • Try `ast.literal_eval`, [demo](https://ideone.com/cQX0lX) – Wiktor Stribiżew Apr 21 '19 at 21:31
  • You need to put these values into a database. – miken32 Apr 21 '19 at 21:36
  • could you show what is your desired output? – Rebin Apr 21 '19 at 21:40
  • sorry my desired output would be a list with all the categories like: ['Electronics', 'Computers & Accessories' ... and so on], but if I get something else that I can easily process this is also fine. @miken32 sadly I only get CSV files that come with this crap.. – BayerischerSchweitzer Apr 21 '19 at 21:43
  • It might be worthwhile taking the time to parse the CSV into a database, maybe it's something you could automate to happen during the night. I can't imagine any way that parsing a CSV with millions of rows on demand is going to be fast! – miken32 Apr 21 '19 at 21:47
  • @WiktorStribiżew thanks for the quick help, unfortunately your approach gives me an Error, both in the console (Ubuntu 16.04, with Anaconda 3.7.3 - but also with a venv with 3.5.0) it says something like this: ValueError: malformed node or string: <_ast.Subscript object at 0x7f472a0cb5f8> – BayerischerSchweitzer Apr 21 '19 at 21:50
  • Sooo, @WiktorStribiżew's post relates to what you posted as an example and the error you're receiving is likely because extra data appears in your item which is formatted in a way, or has characters, that ast is fussing about. Wiktor is also implying that with `a couple million entries` regex is likely a bad idea -- it will process those slowly. You might need to use Pandas somehow. With more data Wiktor can likely give you guidance regarding how to make ast handle it. The user Azat Ibrakov can as well. If you're ok with sluggishness though and have `re.findall('[\'\"]([\S\s]*?)[\'\"]', item)` – FailSafe Apr 22 '19 at 04:17
  • Please post your data samples or describe the exact format. Else, no one can help you. – Wiktor Stribiżew Apr 22 '19 at 10:07
  • I did update my question as somehow I cannot answer my own question. thanks for all the help. @WiktorStribiżew – BayerischerSchweitzer Apr 22 '19 at 12:47

1 Answers1

0

this may not be your final answer but it creates a list of categories.

x="[['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Memory Card Adapters']]"

y=x[2:-2]
z=y.split(',')

for item in z:
    print(item)
Rebin
  • 516
  • 1
  • 6
  • 16