Introduction

< Back to Main Page | Forward to Methodology >

Page Table of Contents

< Back to Main Page | ^ Up to Top | Forward to Methodology >

Purpose of This Presentation

This presentation demonstrates reading text files using the C programming language and illustrates some of the differences among different compilers and computer platforms. It was prepared as a project for an advanced C class at Valencia Community College in Orlando, Florida, and assumes that the reader has a basic understanding of C.

< Back to Main Page | ^ Up to Top | Forward to Methodology >

What is a Text File?

An oversimplified answer is:

  • A text file is a disk file that contains only printable characters.

By "printable characters," we normally mean characters that we can type directly from the keyboard (letters, numbers, punctuation symbols, etc.).

This answer is not complete, though. A better answer is:

  • A text file is a disk file that contains only characters with ASCII values between decimal 32 (hex 20) and decimal 127 (hex 7f) inclusive. It may also contain certain characters less than decimal 32.

The ASCII range 0x20 through 0x7f encompasses the "printable characters" as defined above. Characters less than 0x20 provide formatting control (for example, advancing to the next line).

The second answer above probably covers 99% of the text files you'll see. Some text files contain characters greater than decimal 127 (a.k.a. "8-bit text files"), but we won't deal with them here.

< Back to Main Page | ^ Up to Top | Forward to Methodology >

What are Text Files Used For?

Among other uses, the following can all be saved as text files on disk:

  • Source code for programs (in C or any other language).
  • Documentation to be printed or viewed on the screen (e.g., the "readme.txt" file included with many software programs).
  • Data (particularly as used to exchange data between different software programs, computer systems, and organizations).
< Back to Main Page | ^ Up to Top | Forward to Methodology >

Text Files Used in this Presentation

The text files used here are examples of delimited data files and contain the following lines:

        "Scott","Chicago",39
        "Amy","Nokomis",74
        "Ray","Mt Olive",78

This is a data file containg three fields (name, city of birth, and age) and three records (each on a separate line in the file). There are other delimited data formats, but this is perhaps the most common.

Note that each line (record) in a delimited data file can have a different length, based on the actual data it contains.

Ten separate text files were used in this demonstration. They all contain the data shown above, but have different file formats.

< Back to Main Page | ^ Up to Top | Forward to Methodology >

Text File Formats

The file format of a text file is distinct from the data it contains and relates to the values of the individual characters that make up the file. It is comprised of two components:

The "Character Set":

We'll be considering the ASCII character set, which is commonly used on microcomputers (PCs, Unix boxes, and Macs). Two other important character sets, not addressed here, are EBCDIC (often used on minis and mainframes) and Unicode (being popularized by Java).

Line Termination:

In the previous section, we saw that each data record in a delimited text file appears on a separate line. Since the records can have different lengths, how do we know where one line ends and the next begins. The answer is the line terminator.
< Back to Main Page | ^ Up to Top | Forward to Methodology >

Text File Line Termination

Each line in a text file ends with a line terminator. In a delimited data file, this signals the end of the current record. So, the delimited text file described above actually contains:

  "Scott","Chicago",39<Line Terminator>
  "Amy","Nokomis",74<Line Terminator>
  "Ray","Mt Olive",78<Line Terminator>

You won't see the line terminators on your screen or a printout, but they're there. They signal the software (whether it's a text editor, a print formatter, or a text file reader that you wrote) that it's time to move down to the next line.

On computers that use the ASCII character set, the line terminator is usually one of (or a combination of) the following characters:

  Decimal   Hex     Character is
   Value   Value    called a...    Abbreviated
  -------  -----  ---------------  -----------
    10      0a    Line Feed            lf
    13      0d    Carriage Return      cr

As it turns out, each of the three major microcomputer platforms (DOS/Windows, Unix, and Macintosh) USES DIFFERENT CHARACTERS TO INDICATE THE END OF A LINE as described below:

On DOS and Windows Computers:

The line terminator is a CARRIAGE-RETURN / LINE-FEED pair. The sample text files in DOS format contain:
        "Scott","Chicago",39<CR><LF>
        "Amy","Nokomis",74<CR><LF>
        "Ray","Mt Olive",78<CR><LF>

On Unix Computers:

The line terminator is a single LINE-FEED. The sample text files in Unix format contain:
        "Scott","Chicago",39<LF>
        "Amy","Nokomis",74<LF>
        "Ray","Mt Olive",78<LF>

On Macintosh Computers:

The line terminator is a single CARRIAGE-RETURN. The sample text files in Macintosh format contain:
        "Scott","Chicago",39<CR>
        "Amy","Nokomis",74<CR>
        "Ray","Mt Olive",78<CR>

Line termination normally isn't an issue if you're reading a text file that was created on the same platform that you're using. You may run into problems, though, if the text file came from a different platform; for example, you're on a PC and are attempting to read a text file created on a Macintosh.

< Back to Main Page | ^ Up to Top | Forward to Methodology >

The newline Character

If you've programmed in C, you're probably familiar with the newline character as the "\n" that you insert into output streams (for example, printf("\n"); to produce a blank line on the screen).

The newline character in the C langauge is not necessarily the same as the text file line terminator. For the C compilers tested in this presentation:

  newline character == LINE-FEED (0x0a)

This is true of most, if not all, C compilers. Note that C's newline character is the same as the Unix line terminator. This should not be surprising, since C was originally developed to create the Unix operating system.

Unless you're on a Unix computer, the newline character in the C language will not be the same as the text file line terminator.

< Back to Main Page | ^ Up to Top | Forward to Methodology >