1

In Python, when I use readlines() to read from a text file, something that was originally a space will become a literal Unicode character, as shown follows. Where \u2009 is a space in the original text file.

So, I'm using re.sub() to replace these Unicode literal spaces with a normal space.

My code is as follows:

x = "Significant increases in all the lipoprotein fractions were observed in infected untreated mice compared with normal control mice. Treatment with 100 and 250\u2009mg/kg G. lucidum extract produced significant reduction in serum total cholesterol (TC) and low-density cholesterol (LDL-C) contents compared with 500\u2009mg/kg G. lucidum and CQ."

x = re.sub(r'[\x0b\x0c\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000]', " ", x)

I don't know if I'm right?

Although the program looks normal, I'm not sure because I don't understand regular expressions well enough.

umläute
  • 28,885
  • 9
  • 68
  • 122

2 Answers2

2

re.sub("[^\S \t\n\r\f\v]",' ',x)

should do the trick (based on docs.python.org: re — Regular expression operations):

You know that [] is used to indicate a set of characters, and characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched.

The regex pattern [^\S \t\n\r\f\v] reads as

  • ^ (U+005E, Circumflex Accent) Not
  • \S (not a whitespace) or
  • (Space) or
  • \t (Character Tabulation) or
  • \n (Line Feed (LF)) or
  • \r (Carriage Return (CR)) or
  • \f (Form Feed (FF)) or
  • \v (Line Tabulation)

Distributing the outer not (i.e., the complementing ^ in the character class) with De Morgan's law, this is equivalent to “whitespace except any of [ \t\n\r\f\v].”

Including both \r and \n in the pattern correctly handles all of Unix (LF), classic Mac OS (CR), and Windows-ish (CR+LF) newline conventions.
Included a space itself (we do not need translate a space to space)…

\s
For Unicode (str) patterns:
Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages)…

Partially applied the following answer: Regex – Match whitespace but not newlines

General Grievance
  • 4,555
  • 31
  • 31
  • 45
JosefZ
  • 28,460
  • 5
  • 44
  • 83
0

quick solution:

x = " ".join(x.split())
umläute
  • 28,885
  • 9
  • 68
  • 122