Split dataframe columns value prefix and make headers in python dataframe

Question

I have a traffic data that looks like this. Here, each column have data in format meters:seconds. Like in row 1 column 2, 57:9 represents 57 meters and 9 seconds.

0	1	2	3	4	5	6	7	8	9
0:0	57:9	166:34	178:37	203:44	328:63	344:65	436:77	737:108	None
0:0	166:34	178:37	203:43	328:61	436:74	596:51	737:106	None	None
0:0	57:6	166:30	178:33	203:40	328:62	344:64	436:74	596:91	None
0:0	203:43	328:61	None	None	None	None	None	None	None
0:0	57:7	166:20	178:43	203:10	328:61	None	None	None	None

I want to extract meters values from the dataframe and store them in a list in ascending order. Then create a new dataframe in which the the column header will be the meters value (present in the list). Then it will match the meter value in the parent dataframe and add the corresponding second value beneath. The missing meters:second pair should be replaced by NaN and the current pair at the position would move to next column within same row.

The desired outcome is: list = [0,57,166,178,203,328,344,436,596,737]

dataframe:

57	166	178	203	328	344	436	596	737
9	34	37	44	63	65	77	NaN	108
NaN	34	37	43	61	NaN	74	51	106
6	30	33	40	62	64	74	91	None
NaN	NaN	NaN	43	61	None	None	None	None
7	20	43	10	61	None	None	None	None

I know I must use a loop to iterate over whole dataframe. I am new to python so I am unable to solve this. I tried using str.split() but it work only on 1 column. I have 98 columns and 290 rows. This is just one month data. I will be having 12 month data. So, need suggestions and help.

Andrej Kesely · Accepted Answer · 2022-08-30T22:02:22.887

3

Try:

tmp = df1.apply(
    lambda x: dict(
        map(int, val.split(":"))
        for val in x
        if isinstance(val, str) and ":" in val
    ),
    axis=1,
).to_list()
out = pd.DataFrame(tmp)
print(out[sorted(out.columns)])

Prints:

   0    57    166   178  203  328   344   436   596    737
0    0  9.0  34.0  37.0   44   63  65.0  77.0   NaN  108.0
1    0  NaN  34.0  37.0   43   61   NaN  74.0  51.0  106.0
2    0  6.0  30.0  33.0   40   62  64.0  74.0  91.0    NaN
3    0  NaN   NaN   NaN   43   61   NaN   NaN   NaN    NaN
4    0  7.0  20.0  43.0   10   61   NaN   NaN   NaN    NaN

edited Aug 30 '22 at 22:02

answered Aug 30 '22 at 21:57

Andrej Kesely

168,389
15
48
91

My original dataframe name is `df1`. I am getting this error: `AttributeError: 'NoneType' object has no attribute 'split'` – Omicron Aug 30 '22 at 22:01
1

@Omicron See my edit. – Andrej Kesely Aug 30 '22 at 22:02
the `out` creates the dataframe. But column names are not sorted. The print statement does sort, but I want them as a sorted dataframe. – Omicron Aug 30 '22 at 22:08
1

@Omicron Try `out = out[sorted(out.columns)]`. Then `print(out)` – Andrej Kesely Aug 30 '22 at 22:10
Cool. Only problem is the `floating numbers`. I wanted `int`. When I tried `out = out.astype('int64')`, it gave me this error: `IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer' – Omicron Aug 30 '22 at 22:22
1

@Omicron For that question I suggest to read these answers: https://stackoverflow.com/questions/11548005/numpy-or-pandas-keeping-array-type-as-integer-while-having-a-nan-value (in short, try `out = out[sorted(out.columns)].round().astype("Int64")`) – Andrej Kesely Aug 30 '22 at 22:24
1

Thank you :) It did worked. However it changed `NaN` to ``. But that's not a big deal. Also, this code also worked. `out = out.astype('Int64')`. – Omicron Aug 30 '22 at 22:34
Can you please briefly explain what the code do? I am trying to figure it out but not getting the idea. – Omicron Aug 31 '22 at 07:37
1

@Omicron I'm going through the dataframe row by row. In every cell I check if the data inside it is string and it contains `:`. If yes, I split the string to two and create a dictionary from every row. The collect these dictionaries to list and create new dataframe. Then sort the columns afterwards. – Andrej Kesely Aug 31 '22 at 07:42

Split dataframe columns value prefix and make headers in python dataframe

dataframe:

1 Answers1