-4

I am working on a pdf file conversion. I need to replace the original amount value with some other value for safety purpose. But while parsing the pdf file into normal text file, all the contents are brought together in a single line.So during the parsing I am appending some symbols like "~~" in every line end (\n).Now the problem is characters like "~~" might come in any part of the page content after parsing.So I need answer to perform regex in this content. But need to ignore the char "~~" in the page.

Eg: The string "12" might be like any of the ones shown below.
12
12~~
1~~2
1~~2~~
~~12
~~12~~
~~1~~2
~~1~~2~~ -->Just permutations of the string with this character. So I want a regex that matches the string "12" despite the permutation or just to ignore the character "~~".

I want to know how to ignore that character, not remove it.

Hi People, What I gave there was an example with the string "12" to make you understand the situation. The answer below would surely fulfil the requirement for the example I gave, but thats not what I actually meant...The contents will vary for every pdf, and the content of the pdf is HUGE!!! Even If i know the content of the whole pdf, just assume how many places can I insert the (?:~~)? into :(!!!

Legendary Genius
  • 310
  • 3
  • 20
  • Are you asking for help or outsourcing work to us? – anubhava Jul 05 '13 at 05:35
  • how are you extracting the data from pdf – Anirudha Jul 05 '13 at 05:36
  • Hi @Anirudh: While parsing a pdf file all the spaces, new line all non-printable characters are not displayed after conversion. So I am distinguishing them by these symbols "~~". If I use \n instead, even the spaces between words will get to a new line while finally compiling into pdf in the end. – Legendary Genius Jul 05 '13 at 05:37
  • [how to read pdf file using java](http://stackoverflow.com/questions/4784825/how-to-read-pdf-files-using-java) – Anirudha Jul 05 '13 at 05:40
  • [converting pdf to text code](http://www.prasannatech.net/2009/01/convert-pdf-text-parser-java-api-pdfbox.html) – Anirudha Jul 05 '13 at 05:41
  • Hi People, What I gave there was an example with the string "12" to make you understand the situation. The answer below would surely fulfil the requirement for the example I gave, but thats not what I actually meant...The contents will vary for every pdf, and the content of the pdf is HUGE!!! Even If i know the content of the whole pdf, just assume how many places can I insert the (?:~~)? into :(!!! – Legendary Genius Jul 05 '13 at 05:42
  • @LegendaryGenius what library are you using to extract text from pdf..have you referred to above links..Use the library recommended above – Anirudha Jul 05 '13 at 05:46
  • In that case, please show the regex you would have used if you didn't have to account for `~~`s, and then we can tell you how to fix that. – Tim Pietzcker Jul 05 '13 at 05:46
  • Thats exactly the library we are using "PDF Box", that parses the content as said, the prob is the output of this will be a single paragraph. So I need to maintain the structure in the end, to again compile it into a pdf again...So thats the problem. – Legendary Genius Jul 05 '13 at 05:50

2 Answers2

0
(?:~~)?1(?:~~)?2(?:~~)?

matches all your example strings. Is that what you meant?

Explanation:

  • (?:~~) combines two tildes into a single (non-capturing) group.
  • ? makes that group optional.
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Thanks, but the string "12" was an example, not the whole point I meant, read the comments in the questions section, I have explained the situation deeper. – Legendary Genius Jul 05 '13 at 05:45
  • @LegendaryGenius: It's not a good idea to dumb down a problem if the dumbed-down version doesn't actually reflect your problem accurately any more. Also, if you need to clarify your question, please do so in the question itself (using the [edit](http://stackoverflow.com/posts/17481546/edit) link), not in the comments where not everyone sees that. – Tim Pietzcker Jul 05 '13 at 05:49
  • Thanks for the tips:) I am new to this site :P Anyway now that you got the point of my question, do you have any other answer??? To make the regex engine ignore a specific character??? – Legendary Genius Jul 05 '13 at 06:18
  • @LegendaryGenius: You need to show the regex you're currently using (which is not handling `~~`s yet), then I can try to figure out how to make it ignore `~~`. Otherwise, my next guess will be as useless to you as the first one :) – Tim Pietzcker Jul 05 '13 at 06:27
  • I am now using only the way thats in the 1st answer already...The things that I want to use is like match a year in a TAX PDF which usually have the year 1098 & 1099 So I just did it like... (?:~~)?1(?:~~)?0(?:~~)?9(?:~~)?9(?:~~)?--> Just to match the permutation of this one year "1099". But I need help to make it ignore this character itself from the matching... Atleast I need to know whether it is possible to make the engine ignore anything like this or not... – Legendary Genius Jul 05 '13 at 06:35
  • Cos I saw the [link](http://stackoverflow.com/questions/14513595/how-do-i-write-a-regular-expression-that-will-match-characters-in-any-order) where they have shown a way to permute the string "act" and So I was thinking if there could be a way to permute "~~" and "1099" to its permutations ^o)? So is that possible Tim??? – Legendary Genius Jul 05 '13 at 06:37
  • Ah, so you want the match result to be `1099` as if there weren't any tildes in your input? That's not possible with a regex. First, use the regex you have that matches with the tildes, then remove all `~~`s from the string in a second step. – Tim Pietzcker Jul 05 '13 at 06:37
  • Hmmm :( K Tim, I knew what you are saying now, cos thats the only thing I am doing...Again it was 1 eg of matching the year. I gotta do this even for the account Numbers of 12 to 16 char, etc. So I was just hoping to find a solution like in the [Matching char in any order](http://stackoverflow.com/questions/14513595/how-do-i-write-a-regular-expression-that-will-match-characters-in-any-order) Page...Cos writing (?:~~)? in between every character reduces the readability of the code itself. Did you check this link, will that do any help in the idea of permuting my "1099" & "~~" ? – Legendary Genius Jul 05 '13 at 06:46
  • @LegendaryGenius: Well, that link's solution would also make your regex match `9019`, and you don't want that, do you? There really is no other way. – Tim Pietzcker Jul 05 '13 at 06:54
  • Hmmm, Yes it will and thats why I posted this hoping by using some lookarounds there could be some possibilities...So there really isn't a way other than just inserting (?:~~)? in all the possible places??? So gotta live with it :(?!!! – Legendary Genius Jul 05 '13 at 07:09
0

(~*)? matches all your example strings like 12 12~~ 1~~2 1~~2~~ ~~12 ~~12~~ ~~1~~2 ~~~1~~2~~ ab ab~~ a~~b ~~ ~.(a~9

Prashant Dubey
  • 187
  • 1
  • 2
  • 11