0

I have a list of tokenised sentences, for example :

text = ['Selegiline',
 '-',
 'induced',
 'postural',
 'hypotension',
 'in',
 'Parkinson',
 "'",
 's',
 'disease',
 ':',
 'a',
 'longitudinal',
 'study',
 'on',
 'the',
 'effects',
 'of',
 'drug',
 'withdrawal',
 '.']

I want to convert this list into a string, but when punctuation such as - or : appear, I want to remove the extra space, so the final output would look something like this:

Selegiline-induced postural hypotension in Parkinson's disease: a longitudinal study on the effects of drug withdrawal

I tried splitting the list into equal chunks and checking if pair of two objects are words then using a single space; otherwise, no space:

def chunks(xs, n):
    n = max(1, n)
    return (xs[i:i+n] for i in range(0, len(xs), n))
data_first = list(chunks(text, 2))

def check(data):
  second_order = []
  for words in data:
    if all(c.isalpha() for c in words[0]) and all(c.isalpha() for c in words[1]):
      second_order.append(" ".join(words))
    else:
      second_order.append("".join(words))
  return second_order

check(data_first)

But I have to iterate it until the last word (recursive solution). Is there a better way to do this?

Aaditya Ura
  • 12,007
  • 7
  • 50
  • 88
  • 3
    I notice there are two types of spacing requirements. Things like dashes and apostrophes don't need a space on the left or right, but things like commas and colons need no space on the left and one space on the right. – Steven Rumbalski Sep 20 '22 at 18:24
  • 1
    I would argue that you should instead look at the tokenizer. It's unlikely that the correct output of `Parkinson's` should be `Parkinson ' s` – Adam Smith Sep 20 '22 at 18:33

3 Answers3

1

One option might be creating a dictionary of punctuation and the replacement string since each punctuation seems to follow different rules (a colon should retain the space after itself, where a dash should not).

Something like:

punctdict={' - ':'-',' : ':': '," ' ":"'"}
sentence=' '.join(text)
for k,v in punctdict.items():
    sentence = sentence.replace(k, v)
JNevill
  • 46,980
  • 4
  • 38
  • 63
0
text = ['Selegiline',
 '-',
 'induced',
 'postural',
 'hypotension',
 'in',
 'Parkinson',
 "'",
 's',
 'disease',
 ':',
 'a',
 'longitudinal',
 'study',
 'on',
 'the',
 'effects',
 'of',
 'drug',
 'withdrawal',
 '.']
 
def txt_join(txt):
     ans=""
     for s in txt:
         if(s==".") or (s==":"):
           ans=ans.strip()+s+" "
         elif s=="'" or (s=="-"):
            ans=ans.strip()+s
         else:
            ans=ans+s+" "
             
     return ans

print(txt_join(text))

As I understood this will give you the expected result. In this algo. It normaly loop through text list and according to the punctuation it will add spaces.(According to the punctuation have to add if/elif/else conditions.)

YJR
  • 1,099
  • 3
  • 13
-1

What you're looking for is list comprehension. you can read more about it here you could do a list comprehension and then use the replace module to replace space with no space kind of like you've done with append in your solution. You may find this solution useful. It uses .strip instead of replace. I would always avoid using for loops on lists as list comprehension is much less complex and faster. Also this is my first answer so sorry if it's a bit confusing.

  • Consider adding code examples to better explain your logic. It's often easier and clearer to understand than plain text. – Jakob Sep 24 '22 at 16:52