1

I have a list of strings like the following

orig = ["a1  2.3  ABC  4  DEFG  567  b890",
        "a2  3.0  HI  4  5  JKL  67  c65",
        "b1  1.2  MNOP  3  45  67  89  QR  987  d64  e112"]

Context here is that this is a CSV file and certain columns are omitted. I don't think that the pandas csv reader can handle these cases. The idea is now to inject na for the missing values, so the output becomes

corr = ["a1  2.3  ABC  4  na  na  na  DEFG  567  b890",
        "a2  3.0  HI  4  5  na  na  JKL  67  c65",
        "b1  1.2  MNOP  3  45  67  89  QR  987  d64  e112"]

to align the second column with capitalised words later on, when imported in pandas.

The structure is the following: Delimiters between columns are two or more whitespaces and between the two upper case columns have to be four values. In the original file, there are always only two upper case columns, there is at least one and maximal four numbers in between them and there are only number values between these upper case words.
I can write without problem a script in native Python, so please no suggestions for this. But I thought, this might be a case for regex. As a regex beginner, I only managed to extract the string between the two upper case columns with

for line in orig:
    a = re.findall("([A-Z]+[\s\d]+[A-Z]+)", line))
    print(a)
>>>'ABC  4  DEFG' #etc pp

Is there now an easy way in regex to determine, how many numbers are between the upper case words and insert 'na' values to have always four values in between? Or should I do it in native Python?

Of course, if there is a way to do this with the pandas csv reader, that would be even better. But I studied pandas csv_reader docs and haven't found anything useful.

Jongware
  • 22,200
  • 8
  • 54
  • 100
Mr. T
  • 11,960
  • 10
  • 32
  • 54
  • 2
    Use simple python, thats not easy to do with regex - you would need to do something with backtracking and grouping and maybe inserting dummies before replacing - if possible at all which I doubt. regex is the wrong tool here. – Patrick Artner Jan 27 '18 at 14:03
  • I had this impression, too, after reading the re.module manual and looking at regex questions here on SO. But I am often surprised, what people do, so I thought I ask. – Mr. T Jan 27 '18 at 14:14
  • What happens if there's already more than 4 numerics between capitalised words or just a text of "ABC DEF GHI JKL MNO"? Or if there's a mix of numbers and codes that aren't all uppercase between uppercases? Is there any criteria except that there should be "4 words" between uppercase words delimiters? – Jon Clements Jan 27 '18 at 14:29
  • 1
    @Piinthesky okay - thanks. So if you're not adverse to identify the portions between the string using `re.findall` - you can capture some groups and use a callable replacement function in `re.sub`? off the top of my head I've just tried: `re.sub(r'\s{2,}([A-Z]+)\s{2,}(.*?)\s{2,}([A-Z]+)\b', lambda m: ' '.join((m.group(1), 'na' * 4, m.group(3))), orig[1])` - which needs work but might be something useful... – Jon Clements Jan 27 '18 at 14:48
  • 1
    @Piinthesky just always trying to help people learn... do you have an example of *I can write without problem a script in native Python*? At this point - I'm thinking you can actually use some of that already and just let regex handle the calling the replace. – Jon Clements Jan 27 '18 at 14:57
  • Nothing fancy. Split on delimiter, count chunks between capitalised words, insert `na` as appropriate, stitch with join. Rinse and repeat. – Mr. T Jan 27 '18 at 15:03
  • @Piinthesky okay... all that regex will do is identify chunks and call your function anyway... – Jon Clements Jan 27 '18 at 15:26

2 Answers2

2

Based on complete pandas approach split and concat might help i.e

ndf = pd.Series(orig).str.split(expand=True)

#   0    1     2  3     4    5     6     7     8     9     10
#0  a1  2.3   ABC  4  DEFG  567  b890  None  None  None  None
#1  a2  3.0    HI  4     5  JKL    67   c65  None  None  None
#2  b1  1.2  MNOP  3    45   67    89    QR   987   d64  e112

df = pd.concat([ndf.iloc[:,:4], ndf.iloc[:,4:].apply(sorted,key=pd.notnull,axis=1)],1)

df.astype(str).apply(' '.join,axis=1).tolist()

['a1 2.3 ABC 4 None None None None DEFG 567 b890',
 'a2 3.0 HI 4 None None None 5 JKL 67 c65',
 'b1 1.2 MNOP 3 45 67 89 QR 987 d64 e112']
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
  • It seems, your code fills the column between the first and second number with None values, so all rows have the same length. Unfortunately, the rows have different lengths after the second upper case column, so this is not the desired output. But I think I will follow your lead and try to do this in pandas. Thanks for the suggestion. – Mr. T Jan 27 '18 at 14:45
0

Though the consensus seems to be that regex is not the best tool for such a dynamic string substitution, I found the re module quite comfortable to use in this context. The capturing pattern is based on a comment by Jon Clements.

import re

orig = ["a1  2.3  ABC  4  DEFG  567  b890",
        "a2  3.0  HI  4  5  JKL  67  c65",
        "b1  1.2  MNOP  3  45  67  89  QR  987  d64  e112"]

corr = []

for item in orig:
    #capture group starting with first capitalised word and stopping before the second
    col_betw = re.search("\s{2,}([A-Z]+.*)\s{2,}[A-Z]+\s{2,}", item).group(1)
    #determine, how many elements we have in this segment
    nr_col_betw = len(re.split(r"\s{2,}", col_betw))
    #substitute, if not enough numbers
    if nr_col_betw <= 4:
        #fill with NA, which is interpreted by pandas csv reader as NaN
        subst = col_betw + "   NA" * (5 - nr_col_betw) 
        item = item.replace(col_betw, subst, 1)
    corr.append(item)
Mr. T
  • 11,960
  • 10
  • 32
  • 54