The Northeast Blackout: Causes and Cures

Douglas Wilson | Sep 25, 2003

More than a month has passed, and the investigation into how and why the blackout happened is still in its early stages. Nonetheless, a few pieces of the jigsaw are in place, and we can start to consider underlying causes that might have led to this situation. Of course, there is still a great deal of work to be done in piecing together all the relevant events, and understanding the links between them.

Because of the extent of the work remaining, looking at causes of the blackout involves some conjecture at the moment. The whole detailed picture of what happened will take some time to emerge, and after that there could be a great many interpretations of the detail. The need for solutions to avoid future occurrences, however, is urgent. The process of looking for ways forward can start before every detail of the event is filled out.

It is most likely that the cause of the blackout will turn out to involve complex interactions between

  • chance occurrences and coincidences
  • random events made more likely by conditions in the electrical system and the environment
  • cascading chains of events
  • human interventions

There appears to be two distinct periods in the time leading up to the blackout. Up to about 3.40pm1 , there are a number of seemingly unconnected events in the Midwest. Separated in time and geography, these events conspired to create a situation with a high risk of large-scale cascading outage. Then, from about 3.40pm onwards, incidents occurred with increased frequency, locally at first, eventually cascading to create the biggest blackout in history.

Two questions are important. Firstly, how were the circumstances set up so that an event on such a scale could happen? Secondly, why did the cascade propagate through the network so effectively? Unfortunate Chance, or a Weakness Exposed? Considering the events between around noon and the “watershed” around 3.40pm, we see isolated events occurring throughout Indiana, Ohio and Michigan States. Some of these were mini-cascades, others apparently isolated occurrences. All were contained, not causing any widespread outages. Among these events were:

  • transmission lines tripping in Indiana, at least one case tripping from high voltage to lower voltage levels in a mini-cascade
  • two generating units in Ohio tripping
  • one generating unit in Michigan tripping
  • brush fires under an Ohio transmission line causing the line to trip
  • other lines tripping in Ohio

    Did all these events really happen independently? If so, the blackout was a very unfortunate accident! The probability of coincidence of these events, or other events leading to a similar likelihood of collapse is extremely small. The system is designed, built and operated so that such coincidence is vanishingly improbable.

    However, the other possibility is that these events are not all entirely independent. By its nature, the transmission network interconnects a very large number of components, and each component has an influence on the others. In an unstressed system, one component failure (for example, a line tripping or a generator outage) has little influence on other components, but in a system that is highly stressed, failure of one component has a much greater influence on the rest of the system. Thus, in a stressed system, failure of one component can increase the likelihood of other subsequent failures. So the evolution of the high-risk condition arising around 3.40pm may not have been as unlikely as traditional reliability analyses would suggest, because of some form of underlying stress in the system.

    The system may have been stressed by one or more factors, including:

    • high loading of transmission and generation facilities
    • depressed voltages
    • high temperature and humidity
    • dynamic interactions between interconnected generators and other machines

    Certainly, there were some heavily loaded lines and some unusual power flows in the Midwest. With the Davis-Besse nuclear plant out of action, Ohio generation and voltage support reserves were not high for summer demand. Loading on West-East lines was relatively high.

    Depressed voltages are a stress to the system because they cause generation to work to the limits of their capability to support voltage. But depressed voltage tends to be a local issue. Voltage was clearly low in areas of Indiana from around noon, but it is not clear that Ohio or Michigan had low voltage until well into the afternoon, after certain lines tripped between 3pm and 3.40pm.

    Temperature and humidity may have contributed to system stress, with increasing air conditioning loads and reducing cooling capability of some generation. But there were no record temperatures that day, nor was the load especially high for that time of year.

    The possibility of dynamic interactions between generators interconnected by the grid is of interest. These interactions can occur over a wide region. Since incidents are rarely traced to the issue, it can often be overlooked, particularly in regions like the Northeast where there is not a history of known problems. But oscillations between interconnected machines are most likely to cause problems when parts of a system are stressed by loading and voltage issues, power flow patterns are unusual, and bottlenecks exist in the grid. All of these conditions were present, increasingly as the afternoon progressed. The possibility of system stress induced by this mechanism should not be ignored.

    The way in which the events before the 3.40pm watershed were interrelated is not yet known. But it is most likely that these seemingly isolated events were related, through some form of stress to the grid.

    The Cascade Proper
    The stage was set by around 3.40pm for a large-scale blackout. Then several lines at sub-transmission voltage level tripped in quick succession, along with some events at transmission level. Near Cleveland the lower voltage grid started to separate from the high voltage transmission grid, resulting in overloads in the lower voltage, and further trips. Once voltage could not be maintained in the region with the line outages, generation began to trip, causing a slide that resulted in the blackout of not only eastern Michigan and northern Ohio, but also New York and Ontario.

    The final event that blacked out East Michigan was the tripping of the interconnection between Michigan and Ontario. Power that had been flowing into East Michigan from the north suddenly shifted to flow south of Lake Erie through New York into Ontario. New York operators noticed flows into Canada increasing by 500-600MW.

    Within seconds, however, the New York operators noticed the power flow reverse, as a huge power surge back from Ontario into New York, and an associated frequency spike to 63Hz. This enormous power swing appears to be a dynamic effect associated with the changing shape of the grid around the Great Lakes, including the loss of a vital power route through Michigan to Ontario. What happened in that power surge is unlikely to be explained as simple redistribution of power flows. More likely, it was the system’s response in trying to maintain synchronism between the interconnected areas of Ontario and New York through a weak link, following a huge, sudden change in the shape of the grid.

    Managing Blackout Risks
    From an engineering perspective, there are two ways to reduce the risks of future blackouts. One is to strengthen the electricity infrastructure, and the other is to address the methods used to control and manage the existing facilities.

    There are clear needs for more investment in electricity infrastructure. Strengthening the transmission grid and adding generation appropriately located and specified would certainly improve the situation. But it is not the whole solution, since:

    • Major infrastructure investments have long lead times associated. In the meantime, there could be another blackout.
    • Strengthening the grid without addressing the methods and rules to manage it could lead to the same risks recurring as load growth fills up the spare capacity.

    The tried and tested methods to ensure secure operation of the system broadly involve the use of system models to determine boundaries, or constraints, within which the system must be operated. Then the control operators (and also automatic protection mechanisms) ensure that the system stays within these defined boundaries.

    It is worth noting that on 14th August, there were issues in both of the above areas. The IT systems in Midwest ISO to define secure constraints was out of action for an hour and a half. Later, FirstEnergy’s monitoring system for the control room operators to view what was happening on the grid gave problems. Aside from these problems, the systems normally available to the operators in the region would not make it easy for them to take in the big picture of the state of the whole system quickly.

    These standard industry practices must be rigorously applied. In particular, the necessity to maintain detailed and validated models of the network is a vital piece of good practice that is certainly not a trivial task. In particular, accurate prediction of dynamic behaviour of the grid is a big issue.

    By improving the management and operation of these security mechanisms, the risk of outages can certainly be reduced. But this does not address the core issue in large-scale cascading blackouts, which is that underlying system conditions and interactions, in certain circumstances, weaken the grid in such a way that the probability of catastrophic failures is increased.

    In the discussion of the 2003 blackout, some potential linking factors were listed that could contribute to the risk of a condition in which a blackout could occur. There is a need for intelligent grid monitoring tools capable of giving early warning of conditions in which there is a heightened risk of instability and cascading failures. Such tools could empower operators across a wide interconnected region to detect and mitigate these risk conditions. There are monitoring systems developed to address aspects of the blackout risk with early warning and risk mitigation facilities. These are not yet in widespread use, but could be deployed in a short timeframe, in response to the urgent need to prevent recurrences.

    While the need for grid reinforcement is clear, and the need for a sound regulatory environment is vital, we must ensure that the core issues of risk of large blackouts are addressed. The engineering management and oversight of the grid must provide appropriate, enforceable rules to constrain the system in a way that effectively reduces large blackout risks. In addition, that the best possible tools to enable system planners and operators to follow these rules effectively, must be made available.

    1 All times in Eastern Daylight Time

  • Related Topics


    Very nice summary of what is currently known, the difficulty of putting it together and the viable solutions.
    Bill Hannaman

    I agree with all the concepts stated in the article. It’s a valuable summary, and particularly a very didactical piece. I only would have pointed out, besides the measures to detect and mitigate the risk conditions, the need of implementation of Defense Plans to prevent reaching a blackout condition, in case of failure of all the preventive procedures and automatic systems. Those Plans may assure the existence of, at least, active islands capable of participate in the restoration of the whole system. It is not the same issue to start from a blackout condition than to initiate the restoration if the system has been able to maintain several feeded nucleuses. It is also necessary to have standard procedures to perform the restoration, which means, trained personal and operation guides written on the basis of studies upon the electrical system. Obviously I agree that the goal is to absolutely prevent the occurrence of the blackout. But if there is a succession of improbable events that leads to an inevitable collapse, all the actors might be prepared to conduct the restoration as soon as possible, and without any doubts about what must be done.

    The author is right but he's dreaming. Does anyone really believe that utility holding companies will pay for the needed tools? In business as usual, at regulators behest, they will throw some money at the lowest bidder, say "Look, we're at the industry standard." and turn their backs on us once again.

    I think a couple of observations are in order. Communications will always play a critical role in system operation. Systems operators can successfully control and monitor the system in a steady state (or near so) condition, however, when system events dictate the need for rapid, clear, and concise communication between utilities and control areas it has been shown that we can fail. I submit that in addition to bolstering our transmission and control systems, we pay equal attention to the systems used to communicate and how those systems are used under emergency conditions. We can escalate the process of modernization of the electric system, but if we don't look at the rest of the puzzle, we will have gained less than adequate ground. There is far too much to explore in this area to do it justice in this commentary, but suffice it to say communication is key to our future success in power system operations and in handling unusual events. All of our major technological disasters have been strongly linked to poor communications, lack of adequate contingency plans, and people who understand them. In short our problem is only in part an electical one.

    Very interesting article and the comments are also enlightening. It would be interesting to know just what this outage cost the utilities. They certainly lost revenue and starting plants from black isn't easy. Perhaps with that it mind they will want to spend more than the minimum to keep from another such occurance.

    This author has presented a coherent and manageable set of ideas, which can be very helpful in analyzing the causes of the blackout.

    Everything I've read about this blackout so far points in the direction of a classic voltage collapse scenario rather than a generation deficiency scenario. You can have sufficient generation but due to transmission outages during peak load periods the transmission system is so heavily loaded that it is forced to operate near the "nose" of the Q-V curve, putting the system at high risk of voltage instability and collapse with cascadng consequences.

    The author has listed four system stress factors:

    -high loading of transmission and generation facilities
    -depressed voltages
    -high temperature and humidity
    -dynamic interactions between interconnected generators and other machines

    Of these, the first three are related and were present during the time between noon and 3.40 pm. The fact that "there were no record temperatures that day" may not be so important because the cumulative effect of prior "degree-days" can cause near peak load conditions.

    Let me add that following the two 1996 western blackouts, the Western Electricity Coordinating Council (WECC) adopted reactive margin criteria which, to my knowledge, have not yet been implemented in the eastern systems. The 2003 eastern blackout was a case of transmission deficiency, not a case of power deficiency. I am quite confident that when everything has been analyzed, the eastern systems will adopt the same reactive margin criteria that we have here in the west.

    Mr. Wilson states:

    "These standard industry practices must be rigorously applied. In particular, the necessity to maintain detailed and
    validated models of the network is a vital piece of good practice that is certainly not a trivial task. In particular,
    accurate prediction of dynamic behaviour of the grid is a big issue.

    By improving the management and operation of these security mechanisms, the risk of outages can certainly be
    reduced. But this does not address the core issue in large-scale cascading blackouts, which is that underlying system
    conditions and interactions, in certain circumstances, weaken the grid in such a way that the probability of catastrophic
    failures is increased."

    Just when Mr. Wilson makes a crucial point, he backs away. How in the world can intelligent grid monitoring tools
    give operators the ability to detect and mitigate instability and cascading failures if the monitoring and control actions
    required have not been validated on a model that accurately predicts the dynamic behaviour of the grid? It would seem
    nothing more important addresses the underlying system conditions and interactions that lead to large-scale cascading

    I liked your bring up some good points.

    Thanks, all, for your comments so far. I would like to respond to a few:

    I thoroughly agree that communications are a vital link, and August 14th showed up some quite major problems. Reading the control center transcripts, I sympathized with operators trying to grasp the picture of what was happening over the whole system using telephones, without even having a real-time view of which generating units and lines had tripped! Without any rocket-science, comprehensive real-time information on voltage, MW/VAR flows and status would be most valuable. But beyond that, tools using real-time information to recognise potentially significant problems are vital. It is not wise to swamp operators with data without drawing out and presenting the important information!

    Perhaps we need to dream! If, as you suggest, traditional industry standard tools and practices are the best that we can hope for, then I submit there is some regulatory problem to be addressed. Regulation is not working properly if there are tools just round the corner that would break new ground in reliability, but there is no incentive for utilities to take them up.

    Thank you for making the connections between loading, temperature and voltage issues. I submit that the dynamic interaction issue can also be related, as demonstrated in WSCC where oscillatory instability along with voltage instability caused the system separation. The existence or absence of oscillatory behavior on August 14th has not been proved, and system data to investigate this was either not collected, or has not been made available.

    I would be interested in comments on operational experience of the use of the reactive margin criteria in WECC. Has this made a significant impact on reducing risk of voltage collapse? There could be value in real-time estimates of reactive margins on the basis of on-line monitoring.

    The most publicized on-line grid monitoring tools to date come under the umbrella term "Dynamic Security Assessment". All of these tools, as you point out, rely on accurate, validated dynamic models. But this is not the only possibility. In my company, we addressed the oscillatory stability problem by using direct measurements from the grid to provide continuous measurements of system damping. As well as determining how far the system is from oscillatory instability, it is also possible to provide guidance to operators and analysts on how to deal with problems as they come up, without using the system model. It would also be possible to implement a similar direct method to the voltage stability problem.

    I maintain that accurate, validated system dynamic models are vitally important, and I do not get the impression that there is enough attention paid to validating system-wide models dynamic models. But the possibilities for direct-measurement based stability tools should also be part of the solution.

    Douglas, pardon my obsolescence. On-line methods must have passed me by! In the old
    days oscillatory stability problems were analyzed on validated system dynamic models. In
    particular, algorithms that might use direct measurement of system damping to determine
    how far a system was from oscillatory instability were subject to extensive testing on system
    dynamic models. Use of model reference control systems also required constant wringing out
    with such models. The same was true with adaptive and stochastic control schemes.
    Operator guides also used to depend on the knowledge of system behavior gained from
    exercising validated dynamic models. Perhaps all of this is going on so much faster than real
    time that direct-measurement stability tools no longer appear to depend on dynamic models.
    The simulation and modelling RFP released by E2I on 9/2/03 indicates that EPRI feels that
    better system dynamic models are required. Presumably, these would be used to determine
    the effects of, among other things, voltage control commands/actions similar to the ones
    prompted by the direct-measurement stability tools you describe. If your argument is that
    direct measurements from an operating grid are necessary to validate an undertaking like
    the EPRI FSM system, I agree.

    Is there even a model of HOW to model voltage collapse scenarios?

    Your recommendation in the article as follows, "There is a need for intelligent grid monitoring tools capable of giving early warning of conditions in which there is a heightened risk of instability and cascading failures. Such tools could empower operators across a wide interconnected region to detect and mitigate these risk conditions."

    While you didn't actually say it, this recommendation seems to strongly advocate the use of intelligent systems (or expert systems) as tools to help operators deal with unusual events that could lead to islanding and blackout conditions. I wonder if any future articles will help bring all of us up to date on how these systems might be employed to defend against blackouts. I think we would all be interested in knowing which systems have proven most successful in the transmission environment as opposed to power plants where a number of them have already been employed.

    My request sounds like one directed to someone from EPRI, but perhaps your readers with direct experience in expert systems or neural networks would like to respond. I would certainly enjoy reading the comments.