-2

A client sends a csv structure in a .txt file and at other times sends a flat structure with paragraphs and sentences. How can I identify through a program in C# if the file sent by the client is a CSV or a plain text?

CSV Example:

1,33838,20181217,GTR,5,1587,S,"STT PPP USA, SA.",N,2,58.00,3,58.00
2,1,0,0,LHG,1000000007,,6,0,1000000006,
2,2,0,0,LHG,1000000007,,6,0,1000000003,

Text Plane Example:

ASCII Converter enables you to easily convert ASCII characters to their hex, decimal, and binary representations. In addition, base64 encode/decode binary data. As you type in one of the text boxes above, the other boxes are converted on the fly.

The ASCII converter doesn't automatically add spaces between the converted values. You can use the add spaces button to separate the ASCII characters so that the converted values will also be separated from one another.

vs97
  • 5,765
  • 3
  • 28
  • 41
  • 1
    Why not agree to have one standard file convention so there are no surprises? – Ňɏssa Pøngjǣrdenlarp Jul 05 '19 at 22:13
  • 1
    So, if you received a txt file containing something like `Hello, Levi \n Hopefully, you're feeling better today!` Would that be plain text or CSV? Please advise your client to stick to the standards and use CSV file format to when sending CSV data. – 41686d6564 stands w. Palestine Jul 05 '19 at 22:19
  • The best answer would be using AI. – nAviD Jul 05 '19 at 22:24
  • 1
    By far the best answer is to avoid any situation where two entirely different forms of input can be mixed up. You can have different types of data go to different endpoints. There are different file extensions. Don't solve the problem - make it go away. If there's nothing else you can do, you could try to process it as a CSV. Plain text will never accidentally resemble consistent, readable rows of CSV data. If that fails, try it as text. That's still not great because now a malformed CSV could get processed as plain text. – Scott Hannen Jul 05 '19 at 22:25
  • 1
    Assuming you have no choice, the most straighforward solution I can think of is to attempt to import the data as a CSV, validating that there are the same number of columns in every line and that each column is in the expected format (date, text, number, etc). If validation fails, assume it is in paragraph format instead. – John Wu Jul 05 '19 at 22:46
  • Unfortunately, this question (determine the file format) is too broad, as there are many ways/approaches to do it, and many different assumptions can be made (and likely that none of them are 100% reliable). Interesting question, but... off-topic for Stack Overflow. And really, this is a file-naming issue with completely different types of file content. Seems like you're solving the wrong problem. – David Makogon Jul 05 '19 at 23:06
  • You really need to reach an agreement. For any text file (including CSV) you have to read it with the same for character encoding that they wrote it with. And, to consume CSV, you have to agree on the field separator, the line terminator, the quoting format and escaping mechanism, the number of header rows, the types of the columns, and maybe even the decimal character. (Did I miss [anything](https://chriswarrick.com/blog/2017/04/07/csv-is-not-a-standard/)? ) Text files are for experts. CSV even more so. Consider more self-describing formats such as .ods or .xlsx; easy enough to read in C#. – Tom Blodget Jul 07 '19 at 15:13

3 Answers3

1

You cannot guess without parsing a part of the file.

You must parse at least the 2 first lines

If you get the same number of colums with "," separator, you can assume it is a CSV file

0

In this specific case I would look for paragraphs in the text. If there is atleast one, it means it's plain text, otherwise it's CSV.

  • Doesn't work if paragraphs are separated by a single newline. In any case: This is a file-naming issue. Not a guess-the-format issue (and a guess-the-format question is far too broad for Stack Overflow anyway). – David Makogon Jul 05 '19 at 23:03
0

You have a few options. Starting with the client:

  • Agree on a different encoding scheme / or switch to .csv extension (See Effective way to find any file's Encoding)
  • Agree that the first line will contain the column names
  • Agree on a different delimiter so you can probably check for unique frequency of characters

If none of those are options you can probably look for patterns in the file and create an algorithm based on it. The example looks like you can check for comma frequency, or specific column types like numeric values on specific columns.

Felipe Ramos
  • 305
  • 3
  • 6
  • Different delimiter? If it's a CSV, it's comma-separated. Why all this ceremony to correct a file-naming issue? There's no need to agree to any of this, because CSV is already well-known. There's no need for any such processing. Especially as it can never be 100% accurate. – David Makogon Jul 05 '19 at 23:02
  • 1
    Let's see oh "In addition, the term "CSV" also denotes some closely related delimiter-separated formats that use different field delimiters, for example, semicolons. These include tab-separated values and space-separated values. A delimiter that is not present in the field data (such as tab) keeps the format parsing simple". Second why are you so salty the guy just created an account and he has a difficult client. Don't like the question or answer down vote! Why you mad though? – Felipe Ramos Jul 06 '19 at 03:00