0

My goal is to separate data stored in cells to multiple columns in the same row.

For example, I would like to take data that looks like this:

Row 1: [<1><2>][<3><4>][][]

Row 2: [<1><2>][<3><4>][][]

Into data that looks like this:

Row 1: [1][2][3][4]

Row 2: [1][2][3][4]

I tried using the code below to pull the csv and separate each line at the ">"

df = pd.read_csv('file.csv', engine='python', sep="\*>", header=None)

However, the code did not function as anticipated. Instead, the separation occurred at seemingly random and unpredictable points (I'm sure there's a pattern but I don't see it.) And each break created another row as opposed to another column. For example:

Row 1: [<1>][<2>]

Row 2: [<3>]

Row 3: [<4>]

I thought the issue might lie with reading the CSV file so I tried just re-scraping the site with the separator included but it produced the same results so I'm assuming its an issue with the separator call. However, I found that call after trying many others that caused various errors. For example, when I tried using sep = '>' I got the following error: ParserError: '>' expected after '"' and when I tried sep = '\>' , I got the following error: ParserError: Expected 36 fields in line 1106, saw 120. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.

These errors sent me looking though multiple resources including this and this among others.

However, I have find no resources that have successfully demonstrated how I can separate each column within a row following the use of a '>' delimiter. If anyone knows how to do this, please let me know. Your help is much appreciated!

Update:

Here is an actual screenshot of the CSV file for a better understanding of what I was trying to demonstrate above. My end goal is to have all the data is columns I+ have data on one descriptive factor as opposed to many as they do now.

enter image description here

Display name
  • 753
  • 10
  • 28
  • 1
    The data is literally separated by square and angle brackets? Can you post a few lines of input and expected output? – tdelaney Jun 06 '20 at 04:00
  • @tdelaney I have added new information to my post to help elaborate on your question. Please let me know If you need any more information from me. Thank you. – Display name Jun 06 '20 at 04:24
  • Looks like it started out as an html page with a tool converting a table to excel. It may be easier to grab data in the original XML. Exporting this table to CSV and then importing into python should give you html fragments. Perhaps you could use the `csv` module, not pandas, and then BeautifulSoup to parse and extract data. Which columns are you interested in? – tdelaney Jun 06 '20 at 04:39

2 Answers2

0

Would this work:

string="[<1><2>][<3><4>][][]"
string=string.replace("[","")
string=string.replace("]","")
string=string.replace("<","[")
string=string.replace(">","]")
print(string)

Result:

[1][2][3][4]
tfv
  • 6,016
  • 4
  • 36
  • 67
0

I ended up using Google Sheets. Once you upload the csv there is a header titled "data" and then a sub-section titled "split text to columns."

If you want a faster way to do this with code, you can also do the following with pandas:

# new data frame with split value columns 
new = data["Name"].str.split(" ", n = 1, expand = True) 

# making separate first name column from new data frame 
data["First Name"]= new[0] 

# making separate last name column from new data frame 
data["Last Name"]= new[1] 

# Dropping old Name columns 
data.drop(columns =["Name"], inplace = True) 

# df display 
data 
Display name
  • 753
  • 10
  • 28