0

I have a dataset about gun violence for a project. One of the columns includes the participant types, either victim or subject/suspect. The participant column has multiple values within it for each participant in the incident.

import pandas as pd
data = pd.read_csv('Gun violence Shortened version.csv')
data.head()

Output:

 incident_id    date    state   participant_type    
0   461105  1/1/2013    Pennsylvania    0::Victim||1::Victim||2::Victim||3::Victim||4:...   
1   460726  1/1/2013    California  0::Victim||1::Victim||2::Victim||3::Victim||4:...   
2   478855  1/1/2013    Ohio    0::Subject-Suspect||1::Subject-Suspect||2::Vic...   
3   478925  1/5/2013    Colorado    0::Victim||1::Victim||2::Victim||3::Subject-Su...   
4   478959  1/7/2013    North Carolina  0::Victim||1::Victim||2::Victim||3::Subject-Su...   

I want to take each participant and give them their own row while keeping incident_id and date the same:

incident_id date    state   participant_type    
0   461105  1/1/2013    Pennsylvania    Victim
1   461105  1/1/2013    Pennsylvania    Victim
2   461105  1/1/2013    Pennsylvania    Victim
3   461105  1/1/2013    Pennsylvania    Subject-Suspect *this was the 4:: instance that was cut off earlier*

I'm not sure how to accomplish this. I've seen example of splitting a column into two but not how to take from a column into a row.

David Ferenczy Rogožan
  • 23,966
  • 9
  • 79
  • 68
MariR
  • 3
  • 1
  • Welcome to Stack Overflow. Check the Stack Overflow's [help on asking questions](http://stackoverflow.com/help/asking) first, please. Focus on [What topics can I ask about here](http://stackoverflow.com/help/on-topic), [What types of questions should I avoid asking?](http://stackoverflow.com/help/dont-ask), [How to ask a good question](http://stackoverflow.com/help/how-to-ask), [How to create a Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) and [Stack Overflow question checklist](http://meta.stackoverflow.com/questions/260648/stack-overflow-question-checklist). – David Ferenczy Rogožan Nov 20 '19 at 15:21
  • Thank you for accepting my answer. I would also appreciate if you can up-vote it in case it helped you as a compensation of the time I have invested into the answer. Thank you in advance. – David Ferenczy Rogožan Nov 21 '19 at 14:02

2 Answers2

0

I would rather prepare the data in advance using regular Python's data structures and then create a Pandas DataFrame out of it. The reason is that Pandas is not primarily designed for such operations like individual row manipulation and even though there're ways to do it, it's considered an anti-pattern and it's much slower.

The following piece of code parses the CSV data into a regular list using the CSV module from Python's Standard library, while adding multiple rows for each CSV row containing multiple items in the last column. In the last step, Pandas DataFrame is simply created from the preprocessed list:

import pandas as pd
import csv

data = []
with open('Gun violence Shortened versio.csv') as file:
    reader = csv.reader(file, delimiter=',')

    # iterate over all rows in the CSV
    for row in reader:
        # split the content of the last column by the || delimiter into a list
        # if there's no delimiter, it will produce a single-item list
        items = row[3].split('||')

        # append each item from the last column together with other columns
        # as an individual row to the data list, N items will produce N rows
        for item in items:
            data.append([row[0], row[1], row[2], item])

df = pd.DataFrame(data)

It's not the final solution, you'll need to skip the first row, clean the individual items from the last column, etc., but that should be trivial.

There are some benchmarks, where the manipulation with rows in Pandas was about 1000× slower than preparing data using Python's data structures and creating a DatFrame from them.

David Ferenczy Rogožan
  • 23,966
  • 9
  • 79
  • 68
  • Need to add - item.split(':')[-1] in your solution to clean the unwanted data in there. – Infinite Nov 20 '19 at 15:57
  • 1
    Sure, that's one of the things I mentioned in the last sentence. I want to help to OP, not to write code for him. – David Ferenczy Rogožan Nov 20 '19 at 16:00
  • Thank you for this solution it's been very helpful! I've been trying to make some adjustments for multiple columns that I want to do the same thing to(participant_status,participant_age, etc.) how would that work? I tried to write items_type = row[3].split('||') items_status = row[4].split('||') items = [items_type,items_status] but that puts everything into the same column and I want type and status to be their own columns. – MariR Nov 20 '19 at 18:10
  • Could you rather update your question and describe what you need there, please? I'm not sure I understand. Ideally, put there example data and how the result should look like. And let me know in a comment when it's done, so I won't miss it. – David Ferenczy Rogožan Nov 21 '19 at 14:06
0

Below is another script that gets you the output , though solution shared by Dawid looks more faster

import pandas as pd
import numpy as np


if __name__ == '__main__':
    df_new = pd.DataFrame()
    pd.set_option('display.width', 100000)
    pd.set_option('display.max_columns', 500)
    dict = {'col1':[461105,460726]
            ,'col2':['0::Victim||1::Victim||2::Victim||3::Victim||4:Subject-Suspect','0::Victim||1::Victim||2::Victim||3::Victim||4:Subject-Suspect']}
    df = pd.DataFrame(dict)
    #print(df)
    col1_list=[]
    col2_list = []
    for index,row in df.iterrows():
        for rec in row.col2.split('||'):
            col1_list.append(row.col1)
            col2_list.append(rec.split(':')[-1])
    dict_new = {'col1':col1_list,'col2':col2_list}
    df_new= pd.DataFrame(dict_new)
    #df.append(df_new,ignore_index=True)
    print(df_new)
Infinite
  • 704
  • 8
  • 27
  • I would rather prepare the data before feeding it into Pandas. Pandas has a different purpose than such manipulations with individual rows. Iteration over rows is [considered an anti-pattern in Pandas](https://stackoverflow.com/a/55557758/971141). – David Ferenczy Rogožan Nov 20 '19 at 16:09