1

I'm looking for help in splitting my data. My data has spaces as thousand separators but there's also spaces between my timestamps.

This is an example of what the data looks like (this is currently 1 column):

Date/Time Var1 Var2 Var3 Var4 Var5 Var6
17/04/2020 00:00:00 133 579.20 31 978.90 377 952.81 179 412.41 203 595.20 70 447.40 
17/04/2020 01:00:00 133 583.70 31 980.40 377 963.41 179 412.41 203 595.20 70 448.20

I would need it to look like this:

Date/Time           Var1      Var2     Var3      Var4      Var5      Var6
17/04/2020 00:00:00 133579.20 31978.90 377952.81 179412.41 203595.20 70447.40 
17/04/2020 01:00:00 133583.70 31980.40 377963.41 179412.41 203595.20 70448.20

I'm trying to go around this an odd and over-complicated way I believe. In one instance, I removed all whitespaces and then to split the dates I did:

df.iloc[:,3] = df.iloc[:,3].str.replace('/2020', '/2020 ').str.replace(':00:00', ':00:00 ')

And for then splitting all the numbers, I was trying to do something such as reading every character in a string and once it found a dot, add a whitespace 2 strings ahead but I didn't manage to get that to work.

for i in range(len(df)):
    for j in df.iloc[i,:]:
        for k in j:
            if k == '.':
               #to be continued

Surely there's a much quicker way to get around this. Could anyone give me a hand?

Tom
  • 85
  • 1
  • 7
  • So you want to remove or add white-spaces? Title and question are contradictory. – Pedro Lobito May 01 '20 at 12:59
  • I'd like to remove the thousand separator white-spaces. Removing every white-space and then attempting to add one 2 characters after a period is what I've tried myself as a workaround (same outcome). – Tom May 01 '20 at 13:11

3 Answers3

2

Assuming df is your present dataframe and it has one column named simply 'D' (if it is not 'D', change accordingly):

tmplist = df['D'].str.findall(r'(.+?[:.]\S+\s+)').to_list()
tmplist = [ [ e.replace(' ','') if i>0 else e.rstrip() for i, e in enumerate(row) ] for row in tmplist ]
col = ['Date/Time'] + [ 'Var{}'.format(i) for i in range(1,len(tmplist[0])) ]
df = pandas.DataFrame(tmplist, columns=col)

The first line converts the dataframe into a list of lists, splitting your strings as needed. A certain trick is used: a dot or colon followed by digits is the last of space-separated items that belong to one column (colon for timestamp, dot for floating-point numbers).

The second one removes all spaces from all columns except the first (timestamps), where it only removes trailing spaces.

Next one creates columns names according to your wish.

The last one rebuilds the dataframe from the list.

Błotosmętek
  • 12,717
  • 19
  • 29
  • This is it Błotosmętek. The columns aren't actually called var1, var2 etc but I can just df = pd.DataFrame(tmplist, columns=col_names). Thank you so much! – Tom May 01 '20 at 14:33
1

You can try:

out = df[df.columns[0]].str.split(r"\s+")\
    .apply(lambda x: pd.Series([" ".join(x[:2])] + ["{}{}".format(a, b) for a, b in zip(x[2::2], x[3::2])])) \
    .rename(columns={old:new for old, new in enumerate(cols)})

Explanations:

  1. After selecting the only column using df[df.columns[0]], split this column according all spaces using pandas.Series.str.split. The regex is simple \s+: df[df.columns[0]].str.split(r"\s+")
  2. Apply a custom function on each rows using apply.
    1. First recreate the date by merging the 2 first elements with a space, convert to a date and wrap it in a list: [pd.Timestamp(" ".join(x[:2]))]
    2. Merge all the others values 2-by2 using zip. This discussion provides more details.

[float("{}{}".format(a, b)) for a, b in zip(x[2::2], x[3::2])]

  1. Convert this list in pd.Series wrapping result from step 2.1 and 2.2.

  2. Rename the columns using rename. A dict-comprehension let us perform the expected result: .rename(columns={old:new for old, new in enumerate(cols)})


Full code + illustration:

print(df)
#              Date/Time Var1 Var2 Var3 Var4 Var5 Var6
# 0  17/04/2020 00:00:00 133 579.20 31 978.90 377 9...
# 1  17/04/2020 01:00:00 133 583.70 31 980.40 377 9...

# Step 1
print(df[df.columns[0]].str.split(r"\s+"))
# 0    [17/04/2020, 00:00:00, 133, 579.20, 31, 978.90...
# 1    [17/04/2020, 01:00:00, 133, 583.70, 31, 980.40...
# Name: Date/Time Var1 Var2 Var3 Var4 Var5 Var6, dtype: object

# Step 2.1
print(df[df.columns[0]].str.split(r"\s+")
      .apply(lambda x: [pd.Timestamp(" ".join(x[:2]))]))
# 0    [2020-04-17 00:00:00]
# 1    [2020-04-17 01:00:00]
# Name: Date/Time Var1 Var2 Var3 Var4 Var5 Var6, dtype: object

# Step 2.2
print(df[df.columns[0]].str.split(r"\s+")
      .apply(lambda x: [float("{}{}".format(a, b)) for a, b in zip(x[2::2], x[3::2])]))
# 0    [133579.2, 31978.9, 377952.81, 179412.41, 2035...
# 1    [133583.7, 31980.4, 377963.41, 179412.41, 2035...
# Name: Date/Time Var1 Var2 Var3 Var4 Var5 Var6, dtype: object

# Step 2.3
print(df[df.columns[0]].str.split(r"\s+")
      .apply(lambda x: pd.Series([pd.Timestamp(" ".join(x[:2]))] + [float("{}{}".format(a, b)) for a, b in zip(x[2::2], x[3::2])])))
#                     0         1        2          3          4         5        6
# 0 2020-04-17 00:00:00  133579.2  31978.9  377952.81  179412.41  203595.2  70447.4
# 1 2020-04-17 01:00:00  133583.7  31980.4  377963.41  179412.41  203595.2  70448.2


# Step 3
print(df.columns[0].split(" "))
# ['Date/Time', 'Var1', 'Var2', 'Var3', 'Var4', 'Var5', 'Var6']


out = df[df.columns[0]].str.split(r"\s+")\
    .apply(lambda x: pd.Series([pd.Timestamp(" ".join(x[:2]))] + [float("{}{}".format(a, b)) for a, b in zip(x[2::2], x[3::2])])) \
    .rename(columns={old: new for old, new in enumerate(df.columns[0].split(" "))})
print(out)
#             Date/Time      Var1     Var2       Var3       Var4      Var5     Var6
# 0 2020-04-17 00:00:00  133579.2  31978.9  377952.81  179412.41  203595.2  70447.4
# 1 2020-04-17 01:00:00  133583.7  31980.4  377963.41  179412.41  203595.2  70448.2


print(out.dtypes)
# Date/Time    datetime64[ns]
# Var1                float64
# Var2                float64
# Var3                float64
# Var4                float64
# Var5                float64
# Var6                float64
# dtype: object
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Alexandre B.
  • 5,387
  • 2
  • 17
  • 40
  • I believe this works as long as all numbers contain exactly one thousand separator, so it fails for numbers below 1000 and above 1000000. Which might be acceptable to the OP, or not. – Błotosmętek May 01 '20 at 14:25
  • Yes you're right, if number format changes, `regex` is the solution as you answered :-) – Alexandre B. May 01 '20 at 14:48
0
string = """Date/Time Var1 Var2 Var3 Var4 Var5 Var6
17/04/2020 00:00:00 133 579.20 31 978.90 377 952.81 179 412.41 203 595.20 70 447.40 
17/04/2020 01:00:00 133 583.70 31 980.40 377 963.41 179 412.41 203 595.20 70 448.20"""

head = string.split('\n')[0].split(' ')
values = []
import re
value_regex = re.compile(' \d\d\d\.\d*')
timestamp_regex = re.compile('\d\d:\d\d:\d\d')
for line in string.split('\n')[1:]:
    for value in value_regex.findall(line):
        line = line.replace(value, value.replace(' ','')+',')
    for timestamp in timestamp_regex.findall(line):
        line = line.replace(timestamp, timestamp+',')
    value_cur_line =  [sep[1:] if sep.startswith(' ') else sep for sep in line.split(',') if sep.replace(' ','')!='']
    values.append(value_cur_line)
lucky6qi
  • 965
  • 7
  • 10
  • lucky6qi, thank you so much for that. The data I have is a dataset (Pandas DataFrame), rather than a string. The first line is the header and the other 2 are just the 1st 2 rows of it. Would you be able to adjust that code with that in mind? Apologies if I wasn't as clear, I'm just not fluent enough to get that sorted I think. – Tom May 01 '20 at 13:57