I have a list of strings like the following
orig = ["a1 2.3 ABC 4 DEFG 567 b890",
"a2 3.0 HI 4 5 JKL 67 c65",
"b1 1.2 MNOP 3 45 67 89 QR 987 d64 e112"]
Context here is that this is a CSV file and certain columns are omitted. I don't think that the pandas csv reader can handle these cases. The idea is now to inject na
for the missing values, so the output becomes
corr = ["a1 2.3 ABC 4 na na na DEFG 567 b890",
"a2 3.0 HI 4 5 na na JKL 67 c65",
"b1 1.2 MNOP 3 45 67 89 QR 987 d64 e112"]
to align the second column with capitalised words later on, when imported in pandas.
The structure is the following: Delimiters between columns are two or more whitespaces and between the two upper case columns have to be four values. In the original file, there are always only two upper case columns, there is at least one and maximal four numbers in between them and there are only number values between these upper case words.
I can write without problem a script in native Python, so please no suggestions for this. But I thought, this might be a case for regex. As a regex beginner, I only managed to extract the string between the two upper case columns with
for line in orig:
a = re.findall("([A-Z]+[\s\d]+[A-Z]+)", line))
print(a)
>>>'ABC 4 DEFG' #etc pp
Is there now an easy way in regex to determine, how many numbers are between the upper case words and insert 'na' values to have always four values in between? Or should I do it in native Python?
Of course, if there is a way to do this with the pandas csv reader, that would be even better. But I studied pandas csv_reader docs and haven't found anything useful.