Struggling to clean the column in pandas

Question

I need help with cleaning a column of my data. So, basically in the column, in each separate cell there are dates, time, letter, floating points so many other type of data. The datatype of this column is 'Object'.

What I want to do is, remove all the dates and replace it with empty cells and keep only the time in the entire column. And then I want to insert the average time into the empty cells.

I'm using pycharm and using PANDAS to clean the column.

[enter image description here][1]

Could you add a small portion of the column? Or are you looking for general advice? — Chrysophylaxs, Jan 02 '23 at 00:00
11:00:00 16:00:00 12:30:00 11:30:00 -0.2798558 15/08/2017 13:00:00 15/08/2017 13:40:00 14:00:00 12:00:00 13:30:00 14:30:00 12:10:00 09:00:00 11:00:00 07:30:00 18:30:00 07:30:00 10:45:00 09:00:00 11:00:00 07/03/2007 16/05/2007 08:00:00 09:30:00 07:30:00 10:00:00 09:00:00 09:00:00 07:30:00 08:00:00 15:00:00 10:00:00 07:40:00 11:10:00 10:00:00 07:40:00 07:40:00 10/10/2009 14:45:00 07:30:00 14:20:00 07:30:00 07:45:00 11:00:00 12:29:00 -0.570358 07:32:00 14:34:00 P 09:00:00 12:26:00 12:40:00 — ham007, Jan 02 '23 at 00:05
Each of the value is in the separate cell (like excel, I'm working on .csv file). I want to get rid letter, float values, dates. And just leave the time . — ham007, Jan 02 '23 at 00:06
Please provide enough code so others can better understand or reproduce the problem. — Community, Jan 02 '23 at 00:08
You can use `df["your_col"].str.match(time_regex)` to match every cell against a regex pattern. That will return a series of True and False in case a match was found. You can use that series to either remove cells or overwrite them. — Chrysophylaxs, Jan 02 '23 at 00:11
That is assuming that all entries are strings. You can yoink a regex from e.g. here: https://stackoverflow.com/questions/8318236/regex-pattern-for-hhmmss-time-string — Chrysophylaxs, Jan 02 '23 at 00:11
Are you able to send me a youtube video where I can see how can I get rid to values of I dont need from the column. — ham007, Jan 02 '23 at 00:22

Jisong Xiao · Answer 1 · 2023-01-02T01:05:41.967

0

I would imagine you can achieve this with something along the lines of below. For time format, it seems like for your data column just checking if string contains 2 semi colons is enough. You can also specify something more robust:

def string_splitter (x):
   x=x.split()
   y=[]
   for stuff in x:
      if stuff.index(":")>1: #<you can also replace with a more robust pattern for time>
          y.append(stuff)
      else:
          y.append("")#<add your string for indicating empty space>
return " ".join(y)

df['column_name'].apply(string_splitter)

edited Jan 02 '23 at 01:05

answered Jan 02 '23 at 00:57

Jisong Xiao

91
2

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – user11717481 Jan 02 '23 at 01:29
This what I'm getting "AttributeError: 'float' object has no attribute 'split'" – ham007 Jan 02 '23 at 01:35
I assumed your column is a string to be able to store that many different values in one column, but it seems like I misunderstood the question...is there a way for you to post outputs of df.info() and df.head() into the question (if the data is not sensitive)? – Jisong Xiao Jan 02 '23 at 17:25

Struggling to clean the column in pandas

1 Answers1