1

I have a TON of data in a .txt file formatted like so...

WOODY, Harlan Fred                 S2c        USN
WOOD, Earl A.                      PVT        USAR
WOOD, Frank                        S2c        USN
WOOD, Harold Baker                 BM2c       USN
WOOD, Horace Van                   S1c        USN
WOOD, Roy Eugene                   F1c        USN
WOOLF, Norman Bragg                CWTP       USN
WORKMAN, Creighton Hale            F1c        USN

I would like to get it into a csv format like so...

WOODY,Harlan Fred,S2c,USN

I see in Python I can use regex and/or split, but I need to preserve the spaces between the first and last names. As you can see, the space count is different between most of the entries and the occasional tab may appear as well (I think).

Layne
  • 642
  • 1
  • 13
  • 32
  • Possible duplicate of [How to strip all whitespace from string](http://stackoverflow.com/questions/3739909/how-to-strip-all-whitespace-from-string) – PseudoAj Nov 11 '16 at 23:11
  • 1
    Are these actual service members names? You probably shouldn't be posting actual names if they are. – kylieCatt Nov 11 '16 at 23:43
  • @PseudoAj note that solutions from that page will remove whitespace from `Harlan Fred`. – vaultah Nov 13 '16 at 05:24

4 Answers4

2

Here's one way to do it - splitting on a comma first and then splitting on multiple spaces to avoid splitting the singly spaced names. Then joining all the items with a comma with str.join:

with open(textfile) as f, open(csvfile, 'w') as fc:
    for line in f:
        first, others = line.split(',')
        row = [first] + [i.strip() for i in others.split('   ') if i]
        fc.write(','.join(row) + '\n')
Moses Koledoye
  • 77,341
  • 8
  • 133
  • 139
1

Use pandas read_csv with a regex seperator. Pandas will be faster than a solution written in pure python.

import pandas as pd
pd.read_csv('./s.dat',header=None, delimiter=r"\s+") 
          0          1       2     3     4
0    WOODY,     Harlan    Fred   S2c   USN
1     WOOD,       Earl      A.   PVT  USAR
2     WOOD,      Frank     S2c   USN   NaN
3     WOOD,     Harold   Baker  BM2c   USN
4     WOOD,     Horace     Van   S1c   USN
5     WOOD,        Roy  Eugene   F1c   USN
6    WOOLF,     Norman   Bragg  CWTP   USN
7  WORKMAN,  Creighton    Hale   F1c   USN
kilojoules
  • 9,768
  • 18
  • 77
  • 149
0
line = 'WOODY, Harlan Fred                 S2c        USN'
print(line[:35].rstrip().split(', ') + line[35:].split())

Output:

['WOODY', 'Harlan Fred', 'S2c', 'USN']
Alex Hall
  • 34,833
  • 5
  • 57
  • 89
0

Another way of doing so would be to replace each double space with a comma, then split on the commas, then stripping the non-null values, finally joining with a comma. Apply the following to each line in the textfile:

','.join([x.strip() for x in line.replace('  ',',').split(',') if x])