Python - Spilt Text by character group

Question

I'm trying tying to parse some text into pieces at every group of characters In my case the character groups would be "* ((" and ")) "

import re
file = "Name* ((Bla Bla Bla (Bla Bla) A40 & A41)) Name2* ((Bla Bla Bla (Bla Bla) A42 & A43)) Name3* ((Bla Bla Bla (Bla Bla) A44 & A45)) Name4* ((Bla Bla Bla (Bla Bla) A46 & A47)) Name5* ((Bla Bla Bla (Bla Bla) A48 & A49)) Name6* ((Bla Bla Bla (Bla Bla) A50 & A51)) Name7* ((Bla Bla Bla (Bla Bla) A452 & A53)) Name8* ((Bla Bla Bla (Bla Bla) A54 & A55)) Name9* ((Bla Bla Bla (Bla Bla) A56 & A57)) Name10* ((Bla Bla Bla (Bla Bla) A58 & A59)) Name11* ((Bla Bla Bla (Bla Bla) A60 & A61)) Name12* ((Bla Bla Bla (Bla Bla) A62 & A63)) Name13* ((Bla Bla Bla (Bla Bla) A64 & A65)) Name14* ((Bla Bla Bla (Bla Bla) A66 & A67)) Name14* ((Bla Bla Bla (Bla Bla) A68 & A69))"
parse = re.split('[* ((][)) ]', file)
print parse

My results come back as:

['Name', '((Bla Bla Bla (Bla Bla) A40 & A41)) Name2', '((Bla Bla Bla (Bla Bla) A42 & A43)) Name3', '((Bla Bla Bla (Bla Bla) A44 & A45)) Name4', '((Bla Bla Bla (Bla Bla) A46 & A47)) Name5', '((Bla Bla Bla (Bla Bla) A48 & A49)) Name6', '((Bla Bla Bla (Bla Bla) A50 & A51)) Name7', '((Bla Bla Bla (Bla Bla) A452 & A53)) Name8', '((Bla Bla Bla (Bla Bla) A54 & A55)) Name9', '((Bla Bla Bla (Bla Bla) A56 & A57)) Name10', '((Bla Bla Bla (Bla Bla) A58 & A59)) Name11', '((Bla Bla Bla (Bla Bla) A60 & A61)) Name12', '((Bla Bla Bla (Bla Bla) A62 & A63)) Name13', '((Bla Bla Bla (Bla Bla) A64 & A65)) Name14', '((Bla Bla Bla (Bla Bla) A66 & A67)) Name14', '((Bla Bla Bla (Bla Bla) A68 & A69))']

It only seems to be splitting the text at the "*". I can't seem to figure out how to setup more than one multi-character separators. Anyone have any suggestions? Thanks.

what are you actually trying to split on? Also are you sure you don't want findall? — Padraic Cunningham, Mar 04 '16 at 20:18
I'm trying to split at every "* ((" and ")) ". I honestly don't know if findall is what I want. The text is essentially in one cell of a table, and I'm trying to parse it out into separate cells, and fields. — user1457123, Mar 04 '16 at 20:22
Do you want to keep the (( and )) or not? Do you want something like `re.split('\*\s+\(\(|\)\)', file)` — Padraic Cunningham, Mar 04 '16 at 20:23
That worked perfectly. It removes the (( and )). Thanks Padraic :) There's quite a bit more I have to do with the text, but this gets me past the parsing part. — user1457123, Mar 04 '16 at 20:35
No worries, you can add it as an answer if it worked and get yourself some rep. You might also need to filter what it returns something like `list(filter(None, (x .strip() for x in re.split('\*\s+\(\(|\)\)', file))))`, also if this is coming from a file you could actually do it without a regex quite easily — Padraic Cunningham, Mar 04 '16 at 20:51

score 0 · Answer 1 · answered Mar 04 '16 at 20:50

I'd try following regex

import re
file = "your....string.... content" #your string goes here.

parse = re.split(r"\*|\)\)|\(\(", file)

OUTPUT:

['Name', ' ', 'Bla Bla Bla (Bla Bla) A40 & A41', ' Name2', ' ', 'Bla Bla Bla (Bla Bla) A42 & A43', ' Name3', ' ', 'Bla Bla Bla (Bla Bla) A44 & A45', ' Name4', ' ', 'Bla Bla Bla (Bla Bla) A46 & A47', ' Name5', ' ', 'Bla Bla Bla (Bla Bla) A48 & A49', ' Name6', ' ', 'Bla Bla Bla (Bla Bla) A50 & A51', ' Name7', ' ', 'Bla Bla Bla (Bla Bla) A452 & A53', ' Name8', ' ', 'Bla Bla Bla (Bla Bla) A54 & A55', ' Name9', ' ', 'Bla Bla Bla (Bla Bla) A56 & A57', ' Name10', ' ', 'Bla Bla Bla (Bla Bla) A58 & A59', ' Name11', ' ', 'Bla Bla Bla (Bla Bla) A60 & A61', ' Name12', ' ', 'Bla Bla Bla (Bla Bla) A62 & A63', ' Name13', ' ', 'Bla Bla Bla (Bla Bla) A64 & A65', ' Name14', ' ', 'Bla Bla Bla (Bla Bla) A66 & A67', ' Name14', ' ', 'Bla Bla Bla (Bla Bla) A68 & A69', '']

This really just adds more whitespace only strings than what I suggested in the comments otherwise it is the same — Padraic Cunningham, Mar 04 '16 at 20:52
Oh, I might have missed your comment. I agree there is a lot more spaces and its natural based on OP requirement — Saleem, Mar 04 '16 at 21:35

score 0 · Answer 2 · answered Apr 19 '16 at 20:42

I wanted to share the solution I ended up using in case anyone else could benefit. There's a mixture of regex in there but I used findall instead of split. Now that I've got this far I have to to look into controlling the output more. The data gets dumped into 3 fields (From_Node, To_Node, Link). I need the value from the first "To_Node" to become the value of the "From_Node" on the next row, and so on. Imagine points along a line, point A to B, then point B to C, then point C to D, etc.... With my limited knowledge I don't even know where to begin looking up this solution. Any ideas?

import re, arcpy

# Local variables:
Table1 = "D:\Database1.mdb\\Table1"
RAW_Data = "D:\Database1.mdb\RAW_Data"

#Create Cursors and Insert Rows
insertcursor = arcpy.da.InsertCursor(Table1, ["From_Node", "To_Node", "Link"])
with arcpy.da.SearchCursor(RAW_Data, ["Field1", "Field1", "Field1"]) as searchcursor:
    try: 
        for row in searchcursor:
            listFrom_Node = re.findall('\w+(?=\*\s*)', row[0]) #From Node
            print listFrom_Node
            print "From Node List Success"
            listTo_Node = re.findall('\w+(?=\*\s*)', row[1]) #To Node
            print listTo_Node
            print "To Node List Success"
            listLink = re.findall('\(\((.*?)\)\)', row[2]) #Link descriptions
            print listLink
            print "Link List Success"
            for n,Value in enumerate(listFrom_Node):
                insertcursor.insertRow((listFrom_Node[n], listTo_Node[n], listLink[n]))
    except:
        print ('Empty Cursor')

score -1 · Answer 3 · answered Mar 04 '16 at 20:28

-1

Can you use split function for strings? That and some list comprehensions would do the job.

In[31]: [i for s in [s.split(')) ') for s in file.split('* ((')] for i in s]
Out[31]: 
['Name',
 'Bla Bla Bla (Bla Bla) A40 & A41',
 'Name2',
 'Bla Bla Bla (Bla Bla) A42 & A43',
 'Name3',
 'Bla Bla Bla (Bla Bla) A44 & A45',
 'Name4',
 'Bla Bla Bla (Bla Bla) A46 & A47',
 'Name5',
 'Bla Bla Bla (Bla Bla) A48 & A49',
 'Name6',
 'Bla Bla Bla (Bla Bla) A50 & A51',
 'Name7',
 'Bla Bla Bla (Bla Bla) A452 & A53',
 'Name8',
 'Bla Bla Bla (Bla Bla) A54 & A55',
 'Name9',
 'Bla Bla Bla (Bla Bla) A56 & A57',
 'Name10',
 'Bla Bla Bla (Bla Bla) A58 & A59',
 'Name11',
 'Bla Bla Bla (Bla Bla) A60 & A61',
 'Name12',
 'Bla Bla Bla (Bla Bla) A62 & A63',
 'Name13',
 'Bla Bla Bla (Bla Bla) A64 & A65',
 'Name14',
 'Bla Bla Bla (Bla Bla) A66 & A67',
 'Name14',
 'Bla Bla Bla (Bla Bla) A68 & A69))']

answered Mar 04 '16 at 20:28

Raf

1,628
3
21
40

Thanks for the suggestion Raf. I think Padraic's answer gets me what I need. I appreciate all the help – user1457123 Mar 04 '16 at 20:39
Raf, it looks like I'm finally getting a chance to get back to this. Please pardon my lack of experience when I ask this question. What do "i" and "s" represent? Is "s" equivalent to "string"? Like I said I'm pretty green with reagrds to variables, etc. – user1457123 Mar 21 '16 at 20:30
@user1457123 Hi, no problem. That's just a way of flattening a list of lists. Have a look here. [http://stackoverflow.com/a/952952/5069105] – Raf Mar 22 '16 at 10:37
So "i" = "item" and "s" = "sublist"? OK that makes sense. – user1457123 Mar 22 '16 at 13:46
So now I have to figure out how to insert the parsed data into certain fields within a table. i.e. "Name..." in one field, and "Bla Bla Bla (Bla..." in another. I'm currently dumping all of the output values into one field. any suggestions on how I can direct the data traffic to go where I want? – user1457123 Mar 22 '16 at 19:06

Python - Spilt Text by character group

3 Answers3