0

I have an issue where extracting data from database it sometimes (quite often) adds spaces in between strings of texts that should not be there.

What I'm trying to do is create a small script that will look at these strings and remove the spaces.

The problem is that the spaces can be in any position in the string, and the string is a variable that changes.

Example:

"StaffID": "0000 25" <- The space in the number should not be there. Is there a way to have the script look at this particular line, and if it finds spaces, to remove them.
Or:"DateOfBirth": "23-10-199 0" <-It would also need to look at these spaces and remove them.

The problem is that the same data also has lines such as:
"Address": " 91 Broad street" <- The spaces should be here obviously.

I've tried using TRIM, but that only removes spaces from start/end.

Worth mentioning that the data extracted is in json format and is then imported using API into the new system.

mhu
  • 17,720
  • 10
  • 62
  • 93
B. Baxter
  • 133
  • 1
  • 10
  • for the props that you KNOW you want to remove all spaces, you can use `-replace '\s{1,}'` to replace 1-or-more spaces with nothing. – Lee_Dailey Jun 07 '19 at 15:46
  • Possible duplicate of [Removing more than one white space in powershell](https://stackoverflow.com/questions/8686816/removing-more-than-one-white-space-in-powershell) – mhu Jun 07 '19 at 17:29
  • 2
    This looks like JSON. A good question, I think, to ask is - where is this 'bad' data coming from? Is it in the original database like this `23-10-199 0`? If so, you've got data issues that are VERY complicated to overcome. This may be a "fix your data" answer first. Otherwise, is the extract process causing the problem? – gravity Jun 07 '19 at 17:29
  • It is a JSON. I've gone over the database and its recorder properly in the database i.e. no spaces. Obviously something to do with how its extracted and then formatted, but I cant seem to find the issue, so thought it might be easier to do a script to remove the spaces. Looks like I was wrong.. – B. Baxter Jun 08 '19 at 09:54

1 Answers1

2

You should think about the logic of what you want to do, and whether or not it's programmatically possible to determine if you can teach your script where it is or is not appropriate to put spaces. As it is, this is one of the biggest problems facing AI research right now, so unfortunately you're probably going to have to do this by hand.

If it were me, I'd specify the kind of data format that I expect from each column, and try my best to attempt to parse those strings. For example, if you know that StaffID doesn't contain spaces, you can have a rule that just deletes them:

$staffid = $staffid.replace("\s+",'')

There are some more complicated things that you can do with forced formatting (.replace) that have already been covered in this answer, but again, that requires some expectation of exactly what data is going to come out of what column.

You might want to look more closely at where those spaces are coming from, rather than process the output like this. Is the retrieval script doing it? Maybe you can optimize the database that you're drawing from?

ncfx1099
  • 367
  • 1
  • 11