1

Using Pandas, I have a data frame with a column containing a string that I am splitting when a ; or , is seen:

import re
re.split(';|,',x)

I want to iterate through the column in the whole data frame and create a copy of the current data frame with the new splits.

This is what I was trying based off of other answers here.

for row in x:
    if pd.notnull(x):
        SplitIDs = re.split(';|,',x)
        df.iloc[0, df.columns.get_loc('x')] = SplitIDs

I don't know how to access the particular cell that the "for loop" is currently looking at in order to change it to the split format (for the new copy of the data frame).

If I could also get instruction on how to save these changes into a new copy of the data frame, that would be great.

I apologize if my question is not clear. I am very new to scripting in general - the more detailed your explanation is, the better. Thanks!


Alternatively, what if I wanted to create new columns every time the string is split? For example, let's say the string was split into 3 parts now - instead of having the 3 strings under the same existing column, I would like the 2 new pieces are placed into new, adjacent columns.

If we went with this route, if the next row (in the same column) could split into 2 (based on the same parameters we started with), it would take up space of the existing column plus one of the new columns that we just created (and the 3rd would be blank). OR if this row had MORE than the columns we just made (and all the pieces couldn't fit), how do I keep making new columns to fit the pieces?

Eric Coy
  • 95
  • 1
  • 1
  • 6

1 Answers1

0

Let me first describe how indexing works for pandas dataframe. Assuming you have the following daframe:

df = DataFrame(randn(5,2),index=range(0,10,2),columns=list('AB'))
In [12]: df
Out[12]:
    A           B
0   0.767612    0.322622
2   0.875476    2.819955
4   1.876320    -1.591170
6   0.645850    -0.492359
8   0.148593    0.721617

Now for example in order to access a whole row you can use:

df.iloc[[2]]
    A           B   
4   1.876320    -1.591170

You can find more examples here: Pandas Slicing and Indexing. Now let say I want a new column where C where it is A+B. I can basically do the following:

df['C'] = df['A'] + df['B']

Out[23]: df
    A           B           C
0   0.767612    0.322622    1.090235
2   0.875476    2.819955    3.695431
4   1.876320    -1.591170   0.285151
6   0.645850    -0.492359   0.153490
8   0.148593    0.721617    0.870210

As you can see you do not need to access your data cell by cell, you can apply a function to a whole column at the same time. Now, say your column where strings are at is called myStrings, to create a new column based on results of applying a regular expression to that, you can do the following:

df['new_string'] = df['myStrings'].str.replace(r'(\b\S)', r'+\1')

You can apply your own regular expression here. For more on .str function you can check here. To be more specific about what you want:

data = {'raw': ['Arizona 1',
                'Iowa 1',
                'Oregon 0']}
df = pd.DataFrame(data, columns = ['raw'])
df
Out[31]:
    raw
0   Arizona 1
1   Iowa 1
2   Oregon 0

And you want to split this based on space and save the two in two new columns (or even a new dataframe):

df['firstSplit'] = df['raw'].str.split(' ').str.get(0)

This will result the following which I believe is what you are looking for:

df
Out[30]:
    raw         firstSplit
0   Arizona 1   Arizona
1   Iowa 1      Iowa
2   Oregon 0    Oregon
ahajib
  • 12,838
  • 29
  • 79
  • 120
  • `column_name.str.split(',', expand=True)` This not only splits my strings under the selected column, but also widens the data frame by adding ("expanding") more columns after splitting so each split string has its own column. Courtesy of [Working with Text Data](http://pandas.pydata.org/pandas-docs/stable/indexing.html) - "Splitting and Replacing Strings" – Eric Coy Jun 23 '16 at 18:28
  • @EricCoy True but you can still get what you want out of it like what I did in the example. I edited my answer, check out the last example. – ahajib Jun 23 '16 at 18:43
  • If you use ```get(1)``` instead of ```get(0)``` it will give you the second element and so on... – ahajib Jun 23 '16 at 18:49
  • If the data frame had existing additional headers, how would I apply the changes to the data frame we made so I can then manipulate that further? For example, let's say in the example with the "raw" header, there were existing headers, "1", "2", "3" (each with their own data) - let's call this df1 where this has "raw", "1", "2", "3" as headers. Then, after parsing "raw", we now have "raw" and "first split", but I would like to save this into a new data frame with ALL the headers. So df2 should be "raw" "firstsplit", "1", "2", "3" as headers with their respective values. How do I do this? – Eric Coy Jun 27 '16 at 14:13
  • The closest I can get is `df2 = pd.concat([df1, firstSplit], axis=1)` But this places firstSplit on the end instead of replacing where it existed in df1. – Eric Coy Jun 27 '16 at 14:28
  • @EricCoy That is a different question then. You can just make a copy of the existing dataframe. Take a look at this: http://stackoverflow.com/questions/27673231/why-should-i-make-a-copy-of-a-data-frame-in-pandas – ahajib Jun 27 '16 at 15:02