-1

The problem is: I have a data set to clean. I am currently using Python 3.6 as intrepreter in PyCharm(community edition) to work on this.

I need to:

  1. Find a line where the word "Code" appears and
  2. paste all the following lines in a single line together until
  3. the next word "Code" comes

This is would essentially break the data into 2 fields ,namely; Code and details of the company.

The final output needs to be in a table in a text file or csv written through Pycharm itself and this format is critical.

The following is the input(extract from actual textfile) :

345- Code # 98882 +
"Ms, ABDUL RAFAY & COMPANY, +"
"907, 2nd Floor, tradeway Centre,33, Block-6, PECHS, Karach +"
Ph:345598 1334 558106 +
Mr. Abdul rafay Siddiqui +
347 Code # 96663 +
"Ms. BILAL & BROTHERS Plot No.F-8, Estate #2, Lalazar, Karachi Ph:322575.84 +"
Mr. Mubarak Shahid +
A23 - Code : BO229 +
"Ms. RAHMAN & SONS 303, 3rd Floor, Square One, Dundas street, Karachi P:36268947 +"
"Mr, Saleem Mughal +"
"349- Code # 93369 Ms, ALIAPPAREL +"
"Office No. 491/307, 1st  Floor, Blessings Tower near Tipu Burger , P?:34990456 +"
"Mr, Nasir Wali  +"

The output should be like this :

Code  -  Company details 
345- Code # 98882 + -"Ms, ABDUL RAFAY & COMPANY, +""907, 2nd Floor, tradeway Centre,33, Block-6, PECHS, Karach +"Ph:345598 1334 558106 +Mr. Abdul rafay Siddiqui +
347 Code # 96663 +  - "Ms. BILAL & BROTHERS Plot No.F-8, Estate #2, Lalazar, Karachi Ph:322575.84 +"Mr. Mubarak Shahid +

The key to the data is that the company details are sometimes in one line or two or three .So if there could be a way to iterate over these till the next 'Code' appears. I had tried this before in R but couldnt come up with anything concrete excepting adding + which could be stripped off here.

Vamshi
  • 9,194
  • 4
  • 38
  • 54
H.Y
  • 1
  • 4
  • Possible duplicate of [finding a pattern match and concatenating the rest of lines in python](https://stackoverflow.com/questions/45187059/finding-a-pattern-match-and-concatenating-the-rest-of-lines-in-python) – nv_wu Jul 23 '17 at 11:14
  • @nv_wu - i had to format my question again since my input in the former question wasn't proper. So for that matter , it is .I wanted to make the quesiton minamilstic , verifiable and reproducible – H.Y Jul 23 '17 at 11:47

1 Answers1

1

All you need to do is iterate through the file, looking for a line that signals the start of a new data block.

This does more or less what you want:

def emit(lines, dest):
    if lines:
        print("".join(lines), file=dest)

company_data = []

with open('details.txt') as data_in, open('fixed_details','w') as data_out:
    for line in data_in:
        if "Code " in line: # start of a new company: output the previous one
            emit(company_data, data_out)
            company_data = []
        company_data.append (line.strip())
    emit(company_data, data_out)

It doesn't do exactly what you want because your sample output sometimes specifies a hyphen between the company code and the rest of the data, and sometimes a hyphen and a space.

345- Code # 98882 + -"Ms, ABDUL RAFAY ...(etc)
347 Code # 96663 +  - "Ms. BILAL & BROTHERS ...(etc)
                     ^ this is the space

In line 345 there is no space but in line 347 there is. There is no corresponding space in your sample input data so it isn't clear what you want the program to do. I just left the hyphen out. I'll leave sorting that out (and supplying the headings) up to you. You will probably want to change the print() call to distinguish between the first line of data and the rest:

print(lines[0], "-", "".join(lines[1:]), file=dest)

This is the output:

345- Code # 98882 +"Ms, ABDUL RAFAY & COMPANY, +""907, 2nd Floor, tradeway Centre,33, Block-6, PECHS, Karach +"Ph:345598 1334 558106 +Mr. Abdul rafay Siddiqui +
347 Code # 96663 +"Ms. BILAL & BROTHERS Plot No.F-8, Estate #2, Lalazar, Karachi Ph:322575.84 +"Mr. Mubarak Shahid +
A23 - Code : BO229 +"Ms. RAHMAN & SONS 303, 3rd Floor, Square One, Dundas street, Karachi P:36268947 +""Mr, Saleem Mughal +"
"349- Code # 93369 Ms, ALIAPPAREL +""Office No. 491/307, 1st  Floor, Blessings Tower near Tipu Burger , P?:34990456 +""Mr, Nasir Wali  +"
BoarGules
  • 16,440
  • 2
  • 27
  • 44
  • @BoarGules- The hyphen in the output just depicts field separation . Yet in the input its sometimes there or not .I will just try as well – H.Y Jul 23 '17 at 13:06
  • I added an alternative `print()` call that puts the hyphen in, always preceded and followed by a space. – BoarGules Jul 23 '17 at 13:10
  • @BoarGules- Ive tried the code .Cant seem to find the file written . Not in the working directory nor elsewhere. Just to confirm that 'dest' is the destination where the file needs to be written(im putting in the Pycharm directory there)...yes i had added the ".txt" but still not there . though its reading the input.txt. – H.Y Jul 23 '17 at 13:37
  • You may be looking for a filename ending in `.txt`. I omitted the extension by accident. I'm also using PyCharm so I know the output ends up in the same folder as the input. – BoarGules Jul 23 '17 at 13:38
  • @BoarGules- Its reading the input file as data_in . But still doesnt write to anywhere i think ..i changed the dest and put the current working directory in there too. Is that correct? – H.Y Jul 23 '17 at 13:59
  • It definitely does write the file. I did test it, you know. Just temporarily, remove `, file=dest` from the `print()` call. That will send the output to the screen so that you can convince yourself the program produces output. Next, also temporarily, put the full path to where you want the output in the `open()` call. For example, `open(r'c:\users\foo\bar\fixed_details.txt','w')`. – BoarGules Jul 23 '17 at 14:04
  • @BoarGules- I had to open another project , and did everything again and it works out just fine! :) – H.Y Jul 23 '17 at 18:49