1

I have a data frame where the error has crept, where data composed of two different words was split into two separate cells. Example:frame where the error has crept. Example:

    col1    col2    col3    col4   col5
0   A       1       2       3   
1   B       C       3       4      5
2   D       6       7       8   
3   B       E       9       10     11

I would like connect the letters from cells from the same line combine into one cell, while simultaneously moving all the rest of the row to the left. To get the result in the form:

    col1    col2    col3    col4   col5
0   A       1       2       3   
1   B C     3       4       5
2   D       6       7       8   
3   B E     9       10      11

I'm using now df.replace(['C', 'E'], [np.nan, np.nan], regex=True), and later df.iloc[[n]].dropna(axis=1, how="any") by the columns. But this is not ideal, because in the example above I get:

    col1    col2    col3    col4   col5
0   A       1       2       3   
1   B       3       4       5
2   D       6       7       8   
3   B       9       10      11

So two lines of the same name, which in fact is not true. Because they should be called 'B C' and 'B E', respectively. In addition, each time I have to manually enter the cell names to switch to NaN, depending on the data. Do you have any some ideas?

Tomasz Przemski
  • 1,127
  • 9
  • 29

1 Answers1

1

The solution I provided is overkill ...But still work

df1=df.stack().to_frame()

m=df1[0].groupby(level=0).apply(lambda x :x.str.isalpha().diff().ne(0).cumsum())
df1.groupby([df1.index.get_level_values(0),m]).sum().unstack().add_prefix('col1_')
Out[116]: 
  col1_0                     
0 col1_1 col1_2 col1_3 col1_4
0      A      1      2      3
1     BC      3      4      5
2      D      6      7      8
3     BE      9     10     11

Just change here

m=df1[0].groupby(level=0).apply(lambda x :x.str.isalpha().ne(1).cumsum())
df1.groupby([df1.index.get_level_values(0),m]).sum().unstack().add_prefix('col1_')
Out[164]: 
  col1_0                     
0 col1_0 col1_1 col1_2 col1_3
0      A    1,5    2,5    3,5
1     BC    3,5    4,5    5,5
2      D    6,5    7,5    8,5
3     BE    9,5   10,5   11,5
BENY
  • 317,841
  • 20
  • 164
  • 234
  • A little twisted, but it works :) now I'm trying to get rid of the col1_0 line. – Tomasz Przemski Nov 09 '17 at 19:48
  • @TomaszPrzemski https://stackoverflow.com/questions/22233488/pandas-drop-a-level-from-a-multi-level-column-index – BENY Nov 09 '17 at 19:48
  • I just found :) – Tomasz Przemski Nov 09 '17 at 19:50
  • @TomaszPrzemski :-) , yeah, I have habits always keep the original out I get from the 1st place which is not good :-) – BENY Nov 09 '17 at 19:51
  • @TomaszPrzemski is that work for your real data set ? If not , please let me know – BENY Nov 09 '17 at 19:55
  • My data has a very similar structure. but in cells I have decimal numbers with a comma. and the code above puts all the values into one column. And I can't apply `df1.to_string (header = False, index = False) .replace (',', '.')` Because it is a problem with 'str' object. – Tomasz Przemski Nov 09 '17 at 20:15
  • @TomaszPrzemski We can replace it by using `df=df.replace({',':''},regex=True)` or `df=df.replace({',':'.'},regex=True)` – BENY Nov 09 '17 at 20:17
  • I thought it would solve the problem, but all the values are dropped into one column... It follows that the problem lies somewhere else. – Tomasz Przemski Nov 09 '17 at 20:25
  • @TomaszPrzemski I am update my answer , I thought other should be numeric not str ... – BENY Nov 09 '17 at 20:34
  • @You sure , the data you provided is the data you are testing ? – BENY Nov 09 '17 at 21:26
  • Everything is clear. I have in the data extra column with indexes 0,1,2,3: / After clear everything works as it should :) Thanks great Wen! – Tomasz Przemski Nov 09 '17 at 21:38
  • Can some separators be used when using `cumsum()`? – Tomasz Przemski Nov 09 '17 at 22:23
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/158637/discussion-between-wen-and-tomasz-przemski). – BENY Nov 09 '17 at 22:24
  • @TomaszPrzemski https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.cumsum.html you can check here – BENY Nov 09 '17 at 22:25